Home Credit Group published a dataset on Kaggle website with objective of encouraging aspiring data scientists and kagglers to solve a problem of wrongfully loan rejection. As part of the AML course we have selected this project to practice and demonstrate our Machine learning skills.
Phase 2 of the HCDR project contained several main elements: Feature engineering, creating a preprocessing pipeline, and hyperparameter tuning. Our group utilized the supplementary files as well as the main dataset to engineer new features. The group split up the tasks to tackle the supplemental files separately. We then created a preprocessing pipeline to prepare the dataset for final hyperparameter tuning, which included adding new features, cleaning/standardizing/imputing, and dimensionality reduction. Once the final dataset was prepared, we experimented with hyperparameter tuning and balancing methods. We tuned hyperparameters for XGBoost and LightGBM models, as well as tested manual and SMOTE balancing techniques. After experimentation, our best model was a LightGBM model using SMOTE balancing, which secured a Kaggle public score of 0.794 and a private score of 0.789.
The seven different source that would be used to further analyze are: (Credit)
Phase 2 work was divided into three parts. All parts can be found in this notebook.
PART 1 : EDA AND DEVELOPING FEATURES USING SUPPLEMETARY FILES
At the end of Part1 we exported all supllementary processed files that can directly merged with main application train or test dataset.
PART 2 : BUILDING PRE-PROCESSING PIPELINE
At the end of pre-processing, data is exported that can directly use by machine learning models.
PART 3: DEVELOPING MACHINE LEARNING MODEL
This part involves hyper parameter tuning as well as performing analysis using unbalanced dataset, balanced dataset (manual as well as auto), scaled data set etc.
Common functions are created so that we can utilize those for give supplemental files.
# read list of input files from given path
#path = "/content/drive/My Drive/I526 Final Project/data/"
path = "../Project/data/raw/"
def read_files(list_of_files, path, print_details = True):
df_list = []
for file in list_of_files:
chunksize = 500000
filename = path + file
i = 1
for chunk in tqdm(pd.read_csv(filename, chunksize=chunksize, low_memory=False)):
df = chunk if i == 1 else pd.concat([df, chunk])
if print_details:
print('-->Read Chunk...', i)
i += 1
df_list.append(df)
print(file + " ... Read Completed")
print('*'*40)
return df_list
# retun list of columns from datafram which have missing values with given threshhold
def missingdata(data,missing_threshhold):
total = data.isnull().sum().sort_values(ascending = False)
percent = (data.isnull().sum()/data.isnull().count()*100).sort_values(ascending = False)
ms=pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
ms= ms[ms["Percent"] > missing_threshhold]
print('\n\nNumber of columns with over', missing_threshhold ,'percent missing values \n-------------------')
print(len(ms))
f,ax =plt.subplots(figsize=(15,10))
plt.xticks(rotation='90')
fig=sns.barplot(ms.index, ms["Percent"],color="red",alpha=0.8)
plt.xlabel('Column Name', fontsize=15)
plt.ylabel('Percent of missing values', fontsize=15)
plt.title('Percent missing data by Columns', fontsize=15)
return ms
def one_hot_encoder(df, nan_as_category = True, columns = None):
original_columns = list(df.columns)
if columns == None:
categorical_columns = [col for col in df.columns if df[col].dtype == 'object']
else:
categorical_columns = columns
df = pd.get_dummies(df, columns= categorical_columns, dummy_na= nan_as_category)
new_columns = [c for c in df.columns if c not in original_columns]
return df, new_columns
Read files
print(os.listdir("../Project/data/raw/"))
#reduce memory usages of given datafram
def reduce_memory(df):
start_mem = df.memory_usage().sum() / 1024**2
print('memory usage is ' , round(start_mem), 'MB')
for col in df.columns:
col_type = df[col].dtype
if col_type != object:
c_min = df[col].min()
c_max = df[col].max()
if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)
else:
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
df[col] = df[col].astype(np.float16)
elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float64)
else:
df[col] = df[col].astype('category')
end_mem = df.memory_usage().sum() / 1024**2
print('end memory usage is ' , round(end_mem), 'MB')
return df
Bureau dataset has two files, bureau.csv and bureau_balance.csv.
Bureau data set has
Some of features in Bureau dataset are
Bureau balance data set has
Some of features in Bureau dataset are
app_train,bureau, bureau_balance = read_files(['application_train.csv','bureau.csv', 'bureau_balance.csv'], path, print_details=False)
bureau.shape
bureau_balance.shape
bureau.dtypes.value_counts()
bureau.select_dtypes('object').apply(pd.Series.nunique, axis = 0)
bureau.select_dtypes('int').apply(pd.Series.nunique, axis = 0)
missingdata(bureau,1)
plt.figure(figsize=(15,5))
ax=sns.countplot(x='CREDIT_ACTIVE', data=bureau, order=bureau['CREDIT_ACTIVE'].value_counts(normalize=True).index);
plt.title('CREDIT_ACTIVE');
plt.xticks(rotation=90);
plt.figure(figsize=(15,5))
ax=sns.countplot(x='CREDIT_TYPE', data=bureau, order=bureau['CREDIT_TYPE'].value_counts(normalize=True).index);
plt.title('CREDIT_TYPE');
plt.xticks(rotation=90);
columns = bureau.columns[bureau.dtypes == 'object']
for c in columns:
display((bureau[c].value_counts(normalize=True)*100).round(2),(bureau[c].value_counts().round(2)))
print('*'*40)
columns = bureau.columns[bureau.dtypes == 'float']
columns.tolist()
corr_matrix = bureau.corr()
corr_matrix
plt.subplots(figsize = (15,15))
sns.heatmap(bureau.corr(), cmap = 'viridis')
df_bureau_corr = bureau[['DAYS_CREDIT_ENDDATE',
'DAYS_ENDDATE_FACT',
'AMT_CREDIT_MAX_OVERDUE',
'AMT_CREDIT_SUM',
'AMT_CREDIT_SUM_DEBT',
'AMT_CREDIT_SUM_LIMIT',
'AMT_CREDIT_SUM_OVERDUE',
'AMT_ANNUITY'
]].copy()
# Calculate correlations
corr = df_bureau_corr.corr().abs()
# Heatmap
plt.figure(figsize=(15,8))
sns.heatmap(corr, annot=True, linewidths=.2, cmap="icefire");
Bureau data is evaluated with respective of following possible feature categories
bureau.head()
bureau['CREDIT_DURATION'] = -bureau['DAYS_CREDIT'] + bureau['DAYS_CREDIT_ENDDATE']
bureau['ENDDATE_DIF'] = bureau['DAYS_CREDIT_ENDDATE'] - bureau['DAYS_ENDDATE_FACT']
bureau['DEBT_PERCENTAGE'] = bureau['AMT_CREDIT_SUM_DEBT'] / bureau['AMT_CREDIT_SUM']
bureau.head()
# columns = bureau.columns[bureau.dtypes == 'object']
# rows_to_use =1
# plots_in_each_columns = int(len(columns)/rows_to_use)
# fig, axs = plt.subplots(rows_to_use, plots_in_each_columns, figsize=(20,8))
# for i,c in enumerate(columns):
# sns.countplot(bureau[c], ax = axs[i])
sns.heatmap(bureau.isna(), cmap = 'viridis')
From above code we can see that Most credit is either Active or Closed. Also, most clinet uses currency1 for loan type. Majority purpose for taking loan is either consumer credit or Credit Card repayment
app_train_merged = app_train.merge(bureau.groupby('SK_ID_CURR').mean().reset_index(),
left_on='SK_ID_CURR', right_on='SK_ID_CURR',
how='left', validate='one_to_one')
sns.distplot(app_train_merged[app_train_merged['TARGET'] == 0]['DAYS_CREDIT'], label='TARGET 0', bins=50, color='b')
sns.distplot(app_train_merged[app_train_merged['TARGET'] == 1]['DAYS_CREDIT'], label='TARGET 1', bins=50, color='r')
plt.legend()
plt.show()
def overdue(x):
if x < 30:
return 'A'
elif x < 60:
return 'B'
elif x < 90:
return 'C'
elif x < 180:
return 'D'
elif x < 365:
return 'E'
else:
return 'F'
fig, ax = plt.subplots(1, 1, figsize=(16, 6))
sns.kdeplot(app_train_merged[(app_train_merged['TARGET'] == 0) & (app_train_merged['CREDIT_DAY_OVERDUE'] > 30)]['CREDIT_DAY_OVERDUE'].dropna(), label='TARGET 0', color='b')
sns.kdeplot(app_train_merged[(app_train_merged['TARGET'] == 1) & (app_train_merged['CREDIT_DAY_OVERDUE'] > 30)]['CREDIT_DAY_OVERDUE'].dropna(), label='TARGET 1', color='r')
plt.legend()
plt.show()
Function to generate file Bureau features to be merged with application train data
def bureau_data_generator(bureau,bureau_balance):
#bureau, bureau_balance = read_files(['bureau.csv', 'bureau_balance.csv'], path, print_details=False)
bureau_balance['STATUS'] = bureau_balance['STATUS'].map({'C': -1, 'X' : 0})
bureau_balance['STATUS'] = pd.to_numeric(bureau_balance['STATUS'],
errors = 'coerce', )
bb_aggregations = {'MONTHS_BALANCE': ['min', 'max', 'size'],
'STATUS': ['min', 'max', 'mean']
}
bb_agg = bureau_balance.groupby('SK_ID_BUREAU').agg(bb_aggregations)
bb_agg.columns = [e[0] + '_' + e[1].upper() for e in bb_agg.columns.to_list()]
bb_agg.reset_index(inplace=True)
bureau = bureau.merge(bb_agg, on = 'SK_ID_BUREAU', how = 'left')
### Feature Engineering for Bureau Data
rare_categories = ['Microloan', 'Loan for business development',
'Another type of loan','Unknown type of loan',
'Loan for working capital replenishment',
'Cash loan (non-earmarked)', 'Real estate loan',
'Loan for the purchase of equipment',
'Loan for purchase of shares (margin lending)',
'Interbank credit', 'Mobile operator loan']
bureau['CREDIT_ACTIVE_Binary'] = bureau['CREDIT_ACTIVE'].apply(lambda x : 0 if x == 'Closed' else 1)
bureau['CREDIT_ENDDATE_Binary'] = bureau['DAYS_CREDIT_ENDDATE'].apply(lambda x : 1 if x < 0 else 0)
bureau['CREDIT_DURATION'] = -bureau['DAYS_CREDIT'] + bureau['DAYS_CREDIT_ENDDATE']
bureau['ENDDATE_DIF'] = bureau['DAYS_CREDIT_ENDDATE'] - bureau['DAYS_ENDDATE_FACT']
bureau['DEBT_PERCENTAGE'] = bureau['AMT_CREDIT_SUM_DEBT'] / bureau['AMT_CREDIT_SUM']
bureau['CREDIT_TYPE_CATEGORY'] = bureau['CREDIT_TYPE'].apply(lambda x : 'RARE' if x in rare_categories else x)
bureau, bureau_cat = one_hot_encoder(bureau, nan_as_category=True, columns = ['CREDIT_TYPE_CATEGORY'])
cat_aggregations = {}
for cat in bureau_cat: cat_aggregations[cat] = ['mean', 'sum']
num_aggregations = {
'SK_ID_BUREAU' : ['count'],
'CREDIT_TYPE' : ['nunique'],
'CREDIT_ACTIVE_Binary' : ['mean'],
'CREDIT_ENDDATE_Binary' : ['mean'],
'DAYS_CREDIT': [ 'mean', 'var'],
'DAYS_CREDIT_UPDATE': ['mean'],
'CREDIT_DAY_OVERDUE': ['mean'],
'AMT_CREDIT_MAX_OVERDUE': ['mean'],
'AMT_CREDIT_SUM': [ 'mean', 'sum'],
'AMT_CREDIT_SUM_DEBT': [ 'mean', 'sum'],
'AMT_CREDIT_SUM_OVERDUE': ['mean'],
'AMT_CREDIT_SUM_LIMIT': ['mean', 'sum'],
'AMT_ANNUITY': ['max', 'mean'],
'CNT_CREDIT_PROLONG': ['sum'],
'MONTHS_BALANCE_MIN': ['min'],
'MONTHS_BALANCE_MAX': ['max'],
'MONTHS_BALANCE_SIZE': ['mean', 'sum'],
'STATUS_MIN': ['min'],
'STATUS_MAX': ['max'],
'STATUS_MEAN': ['mean'],
'DEBT_PERCENTAGE' : ['mean'],
'CREDIT_DURATION' : ['mean', 'sum'],
'ENDDATE_DIF' : ['mean', 'sum']
}
bureau_agg = bureau.groupby('SK_ID_CURR').agg({**num_aggregations, **cat_aggregations})
bureau_agg.columns = [e[0] + '_' + e[1].upper() for e in bureau_agg.columns.to_list()]
bureau_agg.columns = ['LOAN_COUNT', 'LOAN_TYPES', 'ACTIVE_LOANS_PERCENTAGE', 'CREDIT_ENDDATE_PERCENTAGE',
'AVG_DAYS_FROM_LAST_APPLICATION', 'VARIANCE_DAYS_FROM_LAST_APPLICATION', 'DAYS_CREDIT_UPDATE_MEAN',
'AVG_OVERDUE_DAYS', 'AVG_CREDIT_OVERDUE', 'AVG_CREDIT_AMT', 'TOTAL_CREDIT_AMT', 'AVG_CREDIT_DEBT',
'TOTAL_CREDIT_DEBT', 'AVG_CURRENT_AMT_DUE', 'AVG_CREDIT_CARD_LIMIT', 'TOTAL_CREDIT_CARD_LIMIT',
'MAX_ANNUITY', 'AVG_ANNUITY', 'TOTAL_PROLONGED_CREDIT', 'MONTHS_BALANCE_MIN', 'MONTHS_BALANCE_MAX',
'MONTHS_BALANCE_SIZE_MEAN', 'MONTHS_BALANCE_SIZE_SUM', 'STATUS_MIN', 'STATUS_MAX', 'STATUS_MEAN',
'DEBT_PERCENTAGE_MEAN', 'CREDIT_DURATION_MEAN', 'CREDIT_DURATION_SUM', 'ENDDATE_DIF_MEAN',
'ENDDATE_DIF_SUM',
'AVG_COUNT_CAR_LOAN','TOT_COUNT_CAR_LOAN',
'AVG_COUNT_CONSUMER_LOAN', 'TOT_COUNT_CONSUMER_LOAN',
'AVG_COUNT_CREDIT_CARD_LOAN', 'TOT_COUNT_CREDIT_CARD_LOAN',
'AVG_COUNT_MORTGAGE', 'TOT_COUNT_MORTGAGE',
'AVG_COUNT_RARE_LOAN', 'TOT_COUNT_RARE_LOAN',
'AVG_COUNT_NAN_LOAN', 'TOT_COUNT_NAN_LOAN']
bureau_agg.columns = ['BUREAU_' + e for e in bureau_agg.columns.to_list()]
bureau_agg.reset_index(inplace=True)
return bureau_agg
df = bureau_data_generator(bureau,bureau_balance)
df.head()
#df.to_csv("/content/drive/My Drive/I526 Final Project/data/processed/bureau_processed.csv", index = False)
df.to_csv("../Project/data/processed/bureau_processed.csv", index = False)
app_train_i=app_train[['SK_ID_CURR', 'TARGET']]
app_train_bureau_merged = app_train_i.merge(df.reset_index(),
left_on='SK_ID_CURR', right_on='SK_ID_CURR',
how='left', validate='one_to_one')
corr_matrix = app_train_bureau_merged.corr()
corr_matrix["TARGET"].sort_values(ascending=False)
app_train_bureau_merged.head()
app_train_numeric = app_train_bureau_merged[ app_train_bureau_merged.dtypes[app_train_bureau_merged.dtypes == 'float64'].index]
numeric_index = app_train_numeric.isna().sum()[app_train_numeric.isna().sum()/len(app_train_numeric) <0.5].index[:5]
app_train_numeric = app_train_numeric[numeric_index]
app_train_numeric['TARGET'] = app_train_bureau_merged['TARGET']
g = app_train_numeric.groupby('TARGET')
app_train_numeric = g.apply(lambda x: x.sample(g.size().min()).reset_index(drop=True))
sns.pairplot(app_train_numeric, hue= 'TARGET')
app_train_ic = app_train_bureau_merged[['TARGET', 'BUREAU_AVG_DAYS_FROM_LAST_APPLICATION', 'BUREAU_MONTHS_BALANCE_MIN']]
fig,axs = plt.subplots(2,2,figsize = (15,15))
# plt.figure(figsize=(12,5))
# plt.title("Distribution of AMT_CREDIT")
sns.distplot((app_train_ic["BUREAU_AVG_DAYS_FROM_LAST_APPLICATION"]), ax = axs[0,0])
axs[0,0].set_title('Avg days diff from last application')
axs[0,0].set_xlabel(' Days diff from last app')
sns.boxplot(y =(app_train_ic["BUREAU_AVG_DAYS_FROM_LAST_APPLICATION"]), x = app_train_ic['TARGET'], ax = axs[0,1])
axs[0,1].set_title(' Distirbution of Days diff from last app - Box Plot')
sns.distplot((app_train_ic["BUREAU_MONTHS_BALANCE_MIN"]), ax = axs[1,0])
axs[1,0].set_title(' Distirbution of Month Min Balance')
axs[1,0].set_xlabel('Logarithmic Month Min Balance')
sns.boxplot(y = (app_train_ic["BUREAU_MONTHS_BALANCE_MIN"]), x = app_train_ic['TARGET'], ax = axs[1,1])
axs[1,1].set_title(' Distirbution of Month Min Balance - Box Plot')
Pervious application files have one record for each previous application for give applicant in application_train data. There is one-to-many relationship between application_train and previous_application based on SK_ID_CUU as primary key in application train and foreign key in previous application dataset. SK_ID_PREV is primary key of previous_application dataset.
Previos applicaiton dataset has information about previous loan parameters and applicant details at time of previous loan. Some of features in previous application dataset are
file_list = ['application_train.csv','previous_application.csv']
#app_test, app_train, installment, bureau_balance, pos_cash, bureau, prev_app, cc_bal = read_files(file_list, path, print_details=False)
app_train,prev_app= read_files(file_list, path, print_details=False)
prev_app.head(10)
prev_app.shape
prev_app.dtypes.value_counts()
prev_app.select_dtypes('object').apply(pd.Series.nunique, axis = 0)
prev_app.select_dtypes('int').apply(pd.Series.nunique, axis = 0)
prev_app_counts = prev_app.groupby('SK_ID_CURR', as_index=False)['SK_ID_PREV'].count()
prev_app_counts.head(10)
missingdata(prev_app,20)
cat_list=prev_app.select_dtypes(np.object).columns
cat_tar_list=cat_list.to_list()
#cat_tar_list.append('TARGET')
cat_tar_list
len(cat_tar_list)
fig, ax = plt.subplots(4, 4, figsize=(40, 40))
plt.subplots_adjust(left=None, bottom=None, right=None,
top=None, wspace=None, hspace=0.45)
num = 0
for i in range(0, 4):
for j in range(0, 4):
tst = sns.countplot(x=cat_tar_list[num],
data=prev_app, ax=ax[i][j])
tst.set_title(f"Distribution by {cat_tar_list[num]} Variable")
tst.set_xticklabels(tst.get_xticklabels(), rotation=25)
num = num + 1
columns = prev_app.columns[prev_app.dtypes == 'float64']
columns.tolist()
corr_matrix = prev_app.corr()
corr_matrix
plt.subplots(figsize = (15,15))
sns.heatmap(prev_app.corr(), cmap = 'viridis')
df_prev_app_corr = prev_app[['AMT_ANNUITY',
'AMT_APPLICATION',
'AMT_CREDIT',
'AMT_DOWN_PAYMENT',
'AMT_GOODS_PRICE',
'RATE_DOWN_PAYMENT',
'RATE_INTEREST_PRIMARY',
'RATE_INTEREST_PRIVILEGED',
'CNT_PAYMENT'
]].copy()
# Calculate correlations
corr = df_prev_app_corr.corr().abs()
# Heatmap
plt.figure(figsize=(15,8))
sns.heatmap(corr, annot=True, linewidths=.2, cmap="icefire");
def cat_columns_value_dist(df,column_name):
print("Column:",column_name)
print(pd.DataFrame({'counts': df[column_name].value_counts(),
'pct': round(df[column_name].value_counts() / len(df[column_name]),2)
}))
cat_columns_value_dist(prev_app,'NAME_CONTRACT_TYPE')
cat_columns_value_dist(prev_app,'NAME_CASH_LOAN_PURPOSE')
cat_columns_value_dist(prev_app,'NAME_PAYMENT_TYPE')
cat_columns_value_dist(prev_app,'NAME_TYPE_SUITE')
cat_columns_value_dist(prev_app,'NAME_GOODS_CATEGORY')
cat_columns_value_dist(prev_app,'NAME_PORTFOLIO')
cat_columns_value_dist(prev_app,'NAME_PRODUCT_TYPE')
cat_columns_value_dist(prev_app,'PRODUCT_COMBINATION')
prev_app['NAME_CONTRACT_TYPE_CAT'] = prev_app['NAME_CONTRACT_TYPE'].apply(lambda x : 'CASH_LOAD' if x =='Cash loans' else ('CONSUMER_LOAN' if x == 'Consumer loans' else 'OTHER'))
prev_app['NAME_CASH_LOAN_PURPOSE_CAT'] = prev_app['NAME_CASH_LOAN_PURPOSE'].apply(lambda x : 'XAP' if x =='XAP' else ('XNA' if x == 'XNA' else 'OTHER'))
prev_app['NAME_PAYMENT_TYPE_CAT'] = prev_app['NAME_PAYMENT_TYPE'].apply(lambda x : 'CASH_THROUGH_BANK' if x =='Cash through the bank' else ('XNA' if x == 'XNA' else 'OTHER'))
prev_app['NAME_TYPE_SUITE_CAT'] = prev_app['NAME_TYPE_SUITE'].apply(lambda x : 'Unaccompanied' if x =='Unaccompanied' else 'accompanied')
prev_app['NAME_GOODS_CATEGORY_CAT'] = prev_app['NAME_GOODS_CATEGORY'].apply(lambda x : 'XNA' if x =='XNA' else ('Mobile' if x == 'Mobile' else 'OTHER'))
prev_app['NAME_PORTFOLIO_CAT'] = prev_app['NAME_PORTFOLIO'].apply(lambda x : 'POS' if x =='POS' else ('Cash' if x == 'Cash' else ('XNA' if x == 'XNA' else 'OTHER')))
prev_app['NAME_PRODUCT_TYPE_CAT'] = prev_app['NAME_PRODUCT_TYPE'].apply(lambda x : 'XNA' if x =='XNA' else ('XSELL' if x == 'x-sell' else 'OTHER'))
prev_app, prev_app_cat = one_hot_encoder(prev_app, nan_as_category=False, columns = ['NAME_CONTRACT_TYPE_CAT','NAME_CASH_LOAN_PURPOSE_CAT',
'NAME_PAYMENT_TYPE_CAT', 'NAME_TYPE_SUITE_CAT',
'NAME_GOODS_CATEGORY_CAT', 'NAME_PORTFOLIO_CAT',
'NAME_PRODUCT_TYPE_CAT','NAME_CONTRACT_STATUS'])
prev_app.head()
def prevapp_features(prev_app):
prev_app['NAME_CONTRACT_TYPE_CAT'] = prev_app['NAME_CONTRACT_TYPE'].apply(lambda x : 'CASH_LOAD' if x =='Cash loans' else ('CONSUMER_LOAN' if x == 'Consumer loans' else 'OTHER'))
prev_app['NAME_CASH_LOAN_PURPOSE_CAT'] = prev_app['NAME_CASH_LOAN_PURPOSE'].apply(lambda x : 'XAP' if x =='XAP' else ('XNA' if x == 'XNA' else 'OTHER'))
prev_app['NAME_PAYMENT_TYPE_CAT'] = prev_app['NAME_PAYMENT_TYPE'].apply(lambda x : 'CASH_THROUGH_BANK' if x =='Cash through the bank' else ('XNA' if x == 'XNA' else 'OTHER'))
prev_app['NAME_TYPE_SUITE_CAT'] = prev_app['NAME_TYPE_SUITE'].apply(lambda x : 'Unaccompanied' if x =='Unaccompanied' else 'accompanied')
prev_app['NAME_GOODS_CATEGORY_CAT'] = prev_app['NAME_GOODS_CATEGORY'].apply(lambda x : 'XNA' if x =='XNA' else ('Mobile' if x == 'Mobile' else 'OTHER'))
prev_app['NAME_PORTFOLIO_CAT'] = prev_app['NAME_PORTFOLIO'].apply(lambda x : 'POS' if x =='POS' else ('Cash' if x == 'Cash' else ('XNA' if x == 'XNA' else 'OTHER')))
prev_app['NAME_PRODUCT_TYPE_CAT'] = prev_app['NAME_PRODUCT_TYPE'].apply(lambda x : 'XNA' if x =='XNA' else ('XSELL' if x == 'x-sell' else 'OTHER'))
prev_app, prev_app_cat = one_hot_encoder(prev_app, nan_as_category=False, columns = ['NAME_CONTRACT_TYPE_CAT','NAME_CASH_LOAN_PURPOSE_CAT',
'NAME_PAYMENT_TYPE_CAT', 'NAME_TYPE_SUITE_CAT',
'NAME_GOODS_CATEGORY_CAT', 'NAME_PORTFOLIO_CAT',
'NAME_PRODUCT_TYPE_CAT','NAME_CONTRACT_STATUS'])
num_aggregations = {
'SK_ID_PREV' : ['count'],
'NAME_CONTRACT_TYPE' : ['nunique'],
'AMT_ANNUITY': [ 'mean', 'sum'],
'AMT_APPLICATION': [ 'mean', 'sum'],
'AMT_CREDIT': [ 'mean', 'sum'],
'AMT_DOWN_PAYMENT': [ 'mean', 'sum'],
'AMT_GOODS_PRICE': [ 'mean', 'sum'],
'CNT_PAYMENT': [ 'mean', 'sum','min','max'],
'NAME_CONTRACT_STATUS_Approved': ['sum'],
'NAME_CONTRACT_STATUS_Canceled': ['sum'],
'NAME_CONTRACT_STATUS_Refused': ['sum'],
'NAME_CONTRACT_STATUS_Unused offer': ['sum'],
'NAME_CONTRACT_TYPE_CAT_CASH_LOAD' : ['sum'] ,
'NAME_CONTRACT_TYPE_CAT_CONSUMER_LOAN' : ['sum'] ,
'NAME_CONTRACT_TYPE_CAT_OTHER' : ['sum'] ,
'NAME_CASH_LOAN_PURPOSE_CAT_OTHER' : ['sum'] ,
'NAME_CASH_LOAN_PURPOSE_CAT_XAP' : ['sum'] ,
'NAME_CASH_LOAN_PURPOSE_CAT_XNA' : ['sum'] ,
'NAME_PAYMENT_TYPE_CAT_CASH_THROUGH_BANK' : ['sum'] ,
'NAME_PAYMENT_TYPE_CAT_OTHER' : ['sum'] ,
'NAME_PAYMENT_TYPE_CAT_XNA' : ['sum'] ,
'NAME_TYPE_SUITE_CAT_Unaccompanied' : ['sum'] ,
'NAME_TYPE_SUITE_CAT_accompanied' : ['sum'] ,
'NAME_GOODS_CATEGORY_CAT_Mobile' : ['sum'] ,
'NAME_GOODS_CATEGORY_CAT_OTHER' : ['sum'] ,
'NAME_GOODS_CATEGORY_CAT_XNA' : ['sum'] ,
'NAME_PORTFOLIO_CAT_Cash' : ['sum'] ,
'NAME_PORTFOLIO_CAT_OTHER' : ['sum'] ,
'NAME_PORTFOLIO_CAT_POS' : ['sum'] ,
'NAME_PORTFOLIO_CAT_XNA' : ['sum'] ,
'NAME_PRODUCT_TYPE_CAT_OTHER' : ['sum'] ,
'NAME_PRODUCT_TYPE_CAT_XNA' : ['sum'] ,
'NAME_PRODUCT_TYPE_CAT_XSELL' : ['sum'] ,
'NAME_CONTRACT_STATUS_Approved' : ['sum'] ,
'NAME_CONTRACT_STATUS_Canceled' : ['sum'] ,
'NAME_CONTRACT_STATUS_Refused' : ['sum'] ,
'NAME_CONTRACT_STATUS_Unused offer' : ['sum']
}
prev_app_agg = prev_app.groupby('SK_ID_CURR').agg({**num_aggregations})
prev_app_agg['NAME_CONTRACT_STATUS_REFUSED_RATIO'] = prev_app_agg['NAME_CONTRACT_STATUS_Refused']/(prev_app_agg['NAME_CONTRACT_STATUS_Canceled']+prev_app_agg['NAME_CONTRACT_STATUS_Approved']+prev_app_agg['NAME_CONTRACT_STATUS_Canceled']+prev_app_agg['NAME_CONTRACT_STATUS_Unused offer'])
prev_app_agg.columns = [e[0] + '_' + e[1].upper() for e in prev_app_agg.columns.to_list()]
prev_app_agg.columns = ['PREV_APP_' + e for e in prev_app_agg.columns.to_list()]
prev_app_agg.reset_index(inplace=True)
prev_app_agg.to_csv("../Project/data/processed/previous_application_processed.csv", index = False)
return prev_app_agg
df_prev_app=prevapp_features(prev_app)
app_train_i=app_train[['SK_ID_CURR', 'TARGET']]
app_train_prev_app_merged = app_train_i.merge(df_prev_app.reset_index(),
left_on='SK_ID_CURR', right_on='SK_ID_CURR',
how='left', validate='one_to_one')
corr_matrix = app_train_prev_app_merged.corr()
corr_matrix["TARGET"].sort_values(ascending=False)
app_train_prev_app_merged.head()
app_train_numeric = app_train_prev_app_merged[ app_train_prev_app_merged.dtypes[app_train_prev_app_merged.dtypes == 'float64'].index]
numeric_index = app_train_numeric.isna().sum()[app_train_numeric.isna().sum()/len(app_train_numeric) <0.5].index[:10]
app_train_numeric = app_train_numeric[numeric_index]
app_train_numeric['TARGET'] = app_train_prev_app_merged['TARGET']
g = app_train_numeric.groupby('TARGET')
app_train_numeric = g.apply(lambda x: x.sample(g.size().min()).reset_index(drop=True))
sns.pairplot(app_train_numeric, hue= 'TARGET')
Installments_payments dataset has repayment history for the previously disbursed credits in Home Credit related to the loans in application train dataset. Installment data will be usefull as it shows behavios of applicant in perpective of historical payment trend.
One row is equivalent to one payment of one installment. This data has corresponding foreign key in previos application data. Some of features in previous application dataset are
file_list = ['application_test.csv','installments_payments.csv']
#app_test, app_train, installment, bureau_balance, pos_cash, bureau, prev_app, cc_bal = read_files(file_list, path, print_details=False)
app_test,inst_paym= read_files(file_list, path, print_details=False)
inst_paym = reduce_memory(inst_paym)
inst_paym.shape
inst_paym.dtypes.value_counts()
inst_paym.select_dtypes('int').apply(pd.Series.nunique, axis = 0)
inst_paym.describe()
missingdata(inst_paym,-1)
corr_matrix = inst_paym.corr()
corr_matrix
plt.subplots(figsize = (15,15))
sns.heatmap(inst_paym.corr(), cmap = 'viridis')
inst_paym.columns.tolist()
df_inst_paym_corr = inst_paym[['DAYS_INSTALMENT',
'DAYS_ENTRY_PAYMENT',
'AMT_INSTALMENT',
'AMT_PAYMENT'
]].copy()
# Calculate correlations
corr = df_inst_paym_corr.corr().abs()
# Heatmap
plt.figure(figsize=(15,8))
sns.heatmap(corr, annot=True, linewidths=.2, cmap="icefire");
Some of features categories to evalue on Installemnt data
Late payments/days difference between actual vs expected date of payment
inst_paym['INSTALMENT_ACTUAL_DAYS_DIFF'] = inst_paym['DAYS_ENTRY_PAYMENT']- inst_paym['DAYS_INSTALMENT']
inst_paym['LATE_INSTAL_Binary'] = inst_paym['INSTALMENT_ACTUAL_DAYS_DIFF'].apply(lambda x : 0 if x <= 0 else 1)
Partial payments/amount difference between actual vs expected installment amount
inst_paym['AMT_INSTAL_ACTUAL_DIFF'] = inst_paym['AMT_INSTALMENT']- inst_paym['AMT_PAYMENT']
inst_paym['AMT_INSTAL_ACTUAL_DIFF_Binary'] = inst_paym['AMT_INSTAL_ACTUAL_DIFF'].apply(lambda x : 0 if x <= 0 else 1)
# OHE for late/partial payment
inst_paym, inst_paym_cat = one_hot_encoder(inst_paym, nan_as_category=False, columns = ['LATE_INSTAL_Binary','AMT_INSTAL_ACTUAL_DIFF_Binary'])
def instasllmentfeatures(inst_paym):
inst_paym['INSTALMENT_ACTUAL_DAYS_DIFF'] = inst_paym['DAYS_ENTRY_PAYMENT']- inst_paym['DAYS_INSTALMENT']
inst_paym['LATE_INSTAL_Binary'] = inst_paym['INSTALMENT_ACTUAL_DAYS_DIFF'].apply(lambda x : 0 if x <= 0 else 1)
inst_paym['AMT_INSTAL_ACTUAL_DIFF'] = inst_paym['AMT_INSTALMENT']- inst_paym['AMT_PAYMENT']
inst_paym['AMT_INSTAL_ACTUAL_DIFF_Binary'] = inst_paym['AMT_INSTAL_ACTUAL_DIFF'].apply(lambda x : 0 if x <= 0 else 1)
inst_paym, inst_paym_cat = one_hot_encoder(inst_paym, nan_as_category=False, columns = ['LATE_INSTAL_Binary','AMT_INSTAL_ACTUAL_DIFF_Binary'])
num_aggregations = {
'SK_ID_PREV' : ['count'],
'AMT_INSTALMENT' : ['mean', 'sum'],
'AMT_PAYMENT': [ 'mean', 'sum'],
'INSTALMENT_ACTUAL_DAYS_DIFF': [ 'mean', 'sum'],
'AMT_INSTAL_ACTUAL_DIFF': [ 'mean', 'sum'],
'LATE_INSTAL_Binary_0': ['sum'],
'LATE_INSTAL_Binary_1': [ 'sum'],
'AMT_INSTAL_ACTUAL_DIFF_Binary_0': ['sum'],
'AMT_INSTAL_ACTUAL_DIFF_Binary_1': ['sum']
}
inst_paym_agg = inst_paym.groupby('SK_ID_CURR').agg({**num_aggregations})
inst_paym_agg['LATE_INSTAL_RATIO'] = inst_paym_agg['LATE_INSTAL_Binary_1']/(inst_paym_agg['LATE_INSTAL_Binary_1']+inst_paym_agg['LATE_INSTAL_Binary_0'])
inst_paym_agg['AMT_PARTIAL_INSTAL_RATIO'] = inst_paym_agg['AMT_INSTAL_ACTUAL_DIFF_Binary_1']/(inst_paym_agg['AMT_INSTAL_ACTUAL_DIFF_Binary_0']+inst_paym_agg['AMT_INSTAL_ACTUAL_DIFF_Binary_1'])
inst_paym_agg.columns = [e[0] + '_' + e[1].upper() for e in inst_paym_agg.columns.to_list()]
inst_paym_agg.columns = ['INST_PAY_' + e for e in inst_paym_agg.columns.to_list()]
inst_paym_agg.reset_index(inplace=True)
inst_paym_agg.to_csv("../Project/data/processed/installments_payments_processed.csv", index = False)
return inst_paym_agg
inst_paym_agg= instasllmentfeatures(inst_paym)
app_train_i=app_train[['SK_ID_CURR', 'TARGET']]
app_train_inst_merged = app_train_i.merge(inst_paym_agg.reset_index(),
left_on='SK_ID_CURR', right_on='SK_ID_CURR',
how='left', validate='one_to_one')
corr_matrix = app_train_inst_merged.corr()
corr_matrix["TARGET"].sort_values(ascending=False)
app_train_inst_merged.head()
app_train_numeric = app_train_inst_merged[ app_train_inst_merged.dtypes[app_train_inst_merged.dtypes == 'float64'].index]
numeric_index = app_train_numeric.isna().sum()[app_train_numeric.isna().sum()/len(app_train_numeric) <0.5].index[:5]
app_train_numeric = app_train_numeric[numeric_index]
app_train_numeric['TARGET'] = app_train_inst_merged['TARGET']
g = app_train_numeric.groupby('TARGET')
app_train_numeric = g.apply(lambda x: x.sample(g.size().min()).reset_index(drop=True))
sns.pairplot(app_train_numeric, hue= 'TARGET')
fig,axs = plt.subplots(1,2,figsize = (20,8))
sns.distplot(app_train_merged['INST_PAY_LATE_INSTAL_RATIO_'] , color = 'b', bins = 20, kde = False, ax = axs[0], )
axs[0].set_title('INST_PAY_LATE_INSTAL_RATIO_'); axs[0].set_xlabel('INST_PAY_LATE_INSTAL_RATIO_'); axs[0].set_ylabel('Count');
sns.distplot(app_train_merged[app_train_merged['TARGET'] == 0]['INST_PAY_LATE_INSTAL_RATIO_'] , color = 'b', bins = 10, hist=False, label='TARGET 0', ax = axs[1])
sns.distplot(app_train_merged[app_train_merged['TARGET'] == 1]['INST_PAY_LATE_INSTAL_RATIO_'] , color = 'r', bins = 10, hist=False, label = 'TARGET 1', ax = axs[1])
axs[1].set_title('INST_PAY_LATE_INSTAL_RATIO_ of applicant'); axs[1].set_xlabel('INST_PAY_LATE_INSTAL_RATIO_'); axs[1].set_ylabel('Count');
plt.legend()
Credit card balance has information about
Some of important features in credit card data are
## Reading Application train and test file.
file_list = ['application_test.csv','application_train.csv', 'credit_card_balance.csv']
#app_test, app_train, installment, bureau_balance, pos_cash, bureau, prev_app, cc_bal = read_files(file_list, path, print_details=False)
app_test, app_train, cc = read_files(file_list, path, print_details=False)
cc.head()
cc.shape
cc.dtypes.value_counts()
cc.select_dtypes('object').apply(pd.Series.nunique, axis = 0)
cc.select_dtypes('int').apply(pd.Series.nunique, axis = 0)
cc.describe()
missingdata(cc,0)
plt.figure(figsize=(15,5))
ax=sns.countplot(x='NAME_CONTRACT_STATUS', data=cc, order=cc['NAME_CONTRACT_STATUS'].value_counts(normalize=True).index);
plt.title('NAME_CONTRACT_STATUS');
plt.xticks(rotation=90);
corr_matrix = cc.corr()
corr_matrix
plt.subplots(figsize = (15,15))
sns.heatmap(cc.corr(), cmap = 'viridis')
df_cc_corr = cc[[ 'MONTHS_BALANCE',
'AMT_BALANCE',
'AMT_CREDIT_LIMIT_ACTUAL',
'AMT_DRAWINGS_CURRENT',
'AMT_PAYMENT_CURRENT',
'AMT_PAYMENT_TOTAL_CURRENT',
'AMT_RECEIVABLE_PRINCIPAL',
'AMT_RECIVABLE',
'AMT_TOTAL_RECEIVABLE'
]].copy()
# Calculate correlations
corr = df_cc_corr.corr().abs()
# Heatmap
plt.figure(figsize=(15,8))
sns.heatmap(corr, annot=True, linewidths=.2, cmap="icefire");
Few feature categories which we can evaluate on credit card balance
No. Of LOANS
C = cc_bal.groupby(by = ['SK_ID_CURR'])['SK_ID_PREV'].nunique().reset_index().rename(index = str, columns = {'SK_ID_PREV': 'LOAN_COUNTS'})
display(C['LOAN_COUNTS'].value_counts())
C.head()
No. of installments
grp = cc_bal.groupby(by = ['SK_ID_CURR', 'SK_ID_PREV'])['CNT_INSTALMENT_MATURE_CUM'].max().reset_index().rename(index = str, columns = {'CNT_INSTALMENT_MATURE_CUM': 'NO_INSTALMENTS'})
grp1 = grp.groupby(by = ['SK_ID_CURR'])['NO_INSTALMENTS'].sum().reset_index().rename(index = str, columns = {'NO_INSTALMENTS': 'TOTAL_INSTALMENTS'})
grp1.head(10)
C = C.merge(grp1, on = 'SK_ID_CURR', how = 'left')
C.head()
C.groupby('SK_ID_CURR')['TOTAL_INSTALMENTS'].agg(['mean', 'sum', 'max'])
grp = cc_bal.groupby(by = ['SK_ID_CURR'])['AMT_DRAWINGS_ATM_CURRENT'].sum().reset_index()
grp.head()
C = C.merge(grp, on = 'SK_ID_CURR', how = 'left')
C.head()
grp = cc_bal.groupby(by = ['SK_ID_CURR'])['AMT_DRAWINGS_CURRENT'].sum().reset_index()
grp.head()
C = C.merge(grp, on = 'SK_ID_CURR', how = 'left')
C.head()
grp = cc_bal.groupby(by = ['SK_ID_CURR'])['CNT_DRAWINGS_CURRENT'].sum().reset_index()
grp.head()
C = C.merge(grp, on = 'SK_ID_CURR', how = 'left')
C.head()
Max Credit Limit
max_cr_limit = cc.groupby('sk_id_curr')[['amt_credit_limit_actual']].max().reset_index()
max_cr_limit.columns = ['sk_id_curr','max_cr_lim']
max_cr_limit.head()
Credit Limit Growth
cr_limit_growth = ((cc.groupby('sk_id_curr')['amt_credit_limit_actual'].max() - cc.groupby('sk_id_curr')['amt_credit_limit_actual'].min())
/ cc.groupby('sk_id_curr')['amt_credit_limit_actual'].min()).replace(np.inf, np.nan)
cr_limit_growth.head()
num_months = cc.groupby('sk_id_curr')['months_balance'].count()
num_months.head()
cr_lim_growth_norm = (cr_limit_growth / num_months).reset_index()
cr_lim_growth_norm.columns = ['sk_id_curr', 'cr_lim_growth_norm']
cr_lim_growth_norm.head()
Avg Paid relative to minimum avg installment
grp = cc.groupby(['SK_ID_CURR', 'SK_ID_PREV'])['AMT_PAYMENT_TOTAL_CURRENT',
'AMT_INST_MIN_REGULARITY'].mean().reset_index().groupby(
'SK_ID_CURR')['AMT_PAYMENT_TOTAL_CURRENT', 'AMT_INST_MIN_REGULARITY'].mean().reset_index()
grp['CC_MIN_PAYMENT_RATIO'] = avg_payment['AMT_PAYMENT_TOTAL_CURRENT'] / avg_payment['AMT_INST_MIN_REGULARITY']
grp.head()
avg_payment_norm = (cc.groupby('sk_id_curr')['amt_payment_total_current'].mean()/ cc.groupby('sk_id_curr')['amt_inst_min_regularity'].mean()).reset_index()
avg_payment_norm.columns = ['sk_id_curr', 'avg_payment_norm']
avg_payment_norm
Payment above minimum
cc['paymentdiff'] = cc.amt_payment_total_current - cc.amt_inst_min_regularity
pay_diff = cc[['sk_id_curr','paymentdiff']]
avg_pay_diff = pay_diff.groupby('sk_id_curr')['paymentdiff'].mean().reset_index()
avg_pay_diff.columns = ['sk_id_curr', 'avg_pay_diff']
avg_pay_diff.head()
med_pay_diff = pay_diff.groupby('sk_id_curr')['paymentdiff'].median().reset_index()
med_pay_diff.columns = ['sk_id_curr', 'med_pay_diff']
med_pay_diff.head()
Combine columns
feng_df = max_cr_limit.merge(cr_lim_growth_norm, how = 'left', on = 'sk_id_curr')
feng_df = feng_df.merge(avg_pay_diff, how = 'left', on = 'sk_id_curr')
feng_df = feng_df.merge(med_pay_diff, how = 'left', on = 'sk_id_curr')
feng_df.head()
(feng_df.sk_id_curr.value_counts() > 1).sum()
Contract Status
contract = cc_bal[['SK_ID_CURR', 'NAME_CONTRACT_STATUS']]
contract = pd.get_dummies(contract, columns= ['NAME_CONTRACT_STATUS'] )
contract.head()
drops = ['NAME_CONTRACT_STATUS_Approved', 'NAME_CONTRACT_STATUS_Demand',
'NAME_CONTRACT_STATUS_Refused', 'NAME_CONTRACT_STATUS_Sent proposal',
'NAME_CONTRACT_STATUS_Signed' ]
cc = cc.drop(drops, axis =1)
Credit Load
grp = cc_bal.groupby(by = ['SK_ID_CURR', 'SK_ID_PREV'])['AMT_CREDIT_LIMIT_ACTUAL', 'AMT_BALANCE'].max().reset_index()
grp['balance_to_limit_ratio'] = grp['AMT_BALANCE'] / grp['AMT_CREDIT_LIMIT_ACTUAL']
display(grp.head(10))
grp2 = grp.groupby('SK_ID_CURR')['balance_to_limit_ratio'].mean().reset_index()
display(grp2.head(10))
C = C.merge(grp2, on = 'SK_ID_CURR', how = 'left')
C.head()
Past Due
grp = cc_bal.groupby('SK_ID_CURR')['SK_DPD'].sum().reset_index()
display(grp.head())
C = C.merge(grp, on = 'SK_ID_CURR', how = 'left')
C.head()
## No. Of LOANS
C = cc_bal.groupby(by = ['SK_ID_CURR'])['SK_ID_PREV'].nunique().reset_index().rename(index = str, columns = {'SK_ID_PREV': 'LOAN_COUNTS'})
display(C['LOAN_COUNTS'].value_counts())
C.head()
## No. of installments
grp = cc_bal.groupby(by = ['SK_ID_CURR', 'SK_ID_PREV'])['CNT_INSTALMENT_MATURE_CUM'].max().reset_index().rename(index = str, columns = {'CNT_INSTALMENT_MATURE_CUM': 'NO_INSTALMENTS'})
grp1 = grp.groupby(by = ['SK_ID_CURR'])['NO_INSTALMENTS'].sum().reset_index().rename(index = str, columns = {'NO_INSTALMENTS': 'TOTAL_INSTALMENTS'})
grp1.head(10)
C = C.merge(grp1, on = 'SK_ID_CURR', how = 'left')
C.head()
C.groupby('SK_ID_CURR')['TOTAL_INSTALMENTS'].agg(['mean', 'sum', 'max'])
## Credit Load
grp = cc_bal.groupby(by = ['SK_ID_CURR', 'SK_ID_PREV'])['AMT_CREDIT_LIMIT_ACTUAL', 'AMT_BALANCE'].max().reset_index()
grp['balance_to_limit_ratio'] = grp['AMT_BALANCE'] / grp['AMT_CREDIT_LIMIT_ACTUAL']
display(grp.head(10))
grp2 = grp.groupby('SK_ID_CURR')['balance_to_limit_ratio'].mean().reset_index()
display(grp2.head(10))
C = C.merge(grp2, on = 'SK_ID_CURR', how = 'left')
C.head()
### Past Due
grp = cc_bal.groupby('SK_ID_CURR')['SK_DPD'].sum().reset_index()
display(grp.head())
C = C.merge(grp, on = 'SK_ID_CURR', how = 'left')
C.head()
def f(min_pay, total_pay):
M = min_pay.tolist()
T = total_pay.tolist()
P = len(M)
c = 0
# Find the count of transactions when Payment made is less than Minimum Payment
for i in range(len(M)):
if T[i] < M[i]:
c += 1
return 100*c/P
def credit_data_generator():
cc = read_files(['credit_card_balance.csv'], path, print_details=False)[0]
cc['NAME_CONTRACT_STATUS'] = cc['NAME_CONTRACT_STATUS'].apply(
lambda x : 1 if x == 'Active' else 0)
num_aggregations = {
'SK_ID_PREV' : ['nunique'],
'MONTHS_BALANCE':["sum","mean"],
'AMT_BALANCE':["sum","mean","min","max"],
'AMT_CREDIT_LIMIT_ACTUAL':["sum","mean"],
'AMT_DRAWINGS_ATM_CURRENT':["sum","mean","min","max"],
'AMT_DRAWINGS_CURRENT':["sum","mean","min","max"],
'AMT_DRAWINGS_OTHER_CURRENT':["sum","mean","min","max"],
'AMT_DRAWINGS_POS_CURRENT':["sum","mean","min","max"],
'AMT_INST_MIN_REGULARITY':["sum","mean","min","max"],
'AMT_PAYMENT_CURRENT':["sum","mean","min","max"],
'AMT_PAYMENT_TOTAL_CURRENT':["sum","mean","min","max"],
'AMT_RECEIVABLE_PRINCIPAL':["sum","mean","min","max"],
'AMT_RECIVABLE':["sum","mean","min","max"],
'AMT_TOTAL_RECEIVABLE':["sum","mean","min","max"],
'CNT_DRAWINGS_ATM_CURRENT':["sum","mean"],
'CNT_DRAWINGS_CURRENT':["sum","mean","max"],
'CNT_DRAWINGS_OTHER_CURRENT':["mean","max"],
'CNT_DRAWINGS_POS_CURRENT':["sum","mean","max"],
'CNT_INSTALMENT_MATURE_CUM':["sum","mean","max","min"],
'SK_DPD':["sum","mean","max"],
'SK_DPD_DEF':["sum","mean","max"],
'NAME_CONTRACT_STATUS':["sum","mean","min","max"],
# 'TOTAL_INSTALMENTS':["mean"],
}
cc_agg = cc.groupby('SK_ID_CURR').agg({**num_aggregations})
cc_agg.columns = ['CC_' + e[0] + '_' + e[1].upper() for e in cc_agg.columns.to_list()]
cc_agg.reset_index(inplace=True)
grp = cc.groupby(by = ['SK_ID_CURR', 'SK_ID_PREV'])['CNT_INSTALMENT_MATURE_CUM'].max().reset_index().rename(index = str, columns = {'CNT_INSTALMENT_MATURE_CUM': 'NO_INSTALMENTS'})
grp1 = grp.groupby(by = ['SK_ID_CURR'])['NO_INSTALMENTS'].sum().reset_index().rename(index = str, columns = {'NO_INSTALMENTS': 'CC_TOTAL_INSTALMENTS'})
cc_agg = cc_agg.merge(grp1, on = 'SK_ID_CURR', how ='left')
grp = cc.groupby(by = ['SK_ID_CURR', 'SK_ID_PREV'])['AMT_CREDIT_LIMIT_ACTUAL', 'AMT_BALANCE'].max().reset_index()
grp['balance_to_limit_ratio'] = grp['AMT_BALANCE'] / grp['AMT_CREDIT_LIMIT_ACTUAL']
grp2 = grp.groupby('SK_ID_CURR')['balance_to_limit_ratio'].mean().reset_index().rename(index = str, columns = {'balance_to_limit_ratio': 'CC_CREDIT_LOAD'})
cc_agg = cc_agg.merge(grp2, on = 'SK_ID_CURR', how ='left')
grp = cc.groupby(by = ['SK_ID_CURR']).apply(lambda x: f(x.AMT_INST_MIN_REGULARITY, x.AMT_PAYMENT_CURRENT)).reset_index().rename(
index = str, columns = { 0 : 'CC_PERCENTAGE_MISSED_PAYMENTS'})
cc_agg = cc_agg.merge(grp, on = 'SK_ID_CURR', how ='left')
grp = cc.groupby(['SK_ID_CURR', 'SK_ID_PREV'])['AMT_PAYMENT_TOTAL_CURRENT',
'AMT_INST_MIN_REGULARITY'].mean().reset_index().groupby(
'SK_ID_CURR')['AMT_PAYMENT_TOTAL_CURRENT', 'AMT_INST_MIN_REGULARITY'].mean().reset_index()
grp['CC_MIN_PAYMENT_RATIO'] = grp['AMT_PAYMENT_TOTAL_CURRENT'] / grp['AMT_INST_MIN_REGULARITY']
cc_agg = cc_agg.merge(grp, on = 'SK_ID_CURR', how ='left')
return cc_agg
cc_grouped = credit_data_generator()
#cc_grouped.to_csv('/content/drive/My Drive/I526 Final Project/data/processed/cc_processed.csv', index=False)
cc_grouped.to_csv("../Project/data/processed/cc_processed.csv", index = False)
app_train_i=app_train[['SK_ID_CURR', 'TARGET']]
app_train_cc_merged = app_train_i.merge(cc_grouped.reset_index(),
left_on='SK_ID_CURR', right_on='SK_ID_CURR',
validate='one_to_one')
corr_matrix = app_train_cc_merged.corr()
corr_matrix["TARGET"].sort_values(ascending=False)
app_train_cc_merged.head()
app_train_numeric = app_train_cc_merged[['CC_CNT_DRAWINGS_ATM_CURRENT_MEAN', 'CC_CNT_DRAWINGS_CURRENT_MAX','CC_CREDIT_LOAD','CC_AMT_BALANCE_MEAN','CC_AMT_TOTAL_RECEIVABLE_MEAN','AMT_INST_MIN_REGULARITY']]
numeric_index = app_train_numeric.isna().sum()[app_train_numeric.isna().sum()/len(app_train_numeric) <0.5].index[:5]
app_train_numeric = app_train_numeric[numeric_index]
app_train_numeric['TARGET'] = app_train_cc_merged['TARGET']
g = app_train_numeric.groupby('TARGET')
app_train_numeric = g.apply(lambda x: x.sample(g.size().min()).reset_index(drop=True))
sns.pairplot(app_train_numeric, hue= 'TARGET')
file_list = [ 'application_train.csv', 'POS_CASH_balance.csv']
app_train, pos_cash = read_files(file_list, path, print_details=False)
pos_cash=reduce_memory(pos_cash)
pos_cash.info()
# Check for null values
missingdata(pos_cash,0)
pos_cash.head()
pos_cash.describe()
CNT_INSTALMENT: Term of previous credit (can change over time)
pos_cash['CNT_INSTALMENT'].median()
pos_cash['CNT_INSTALMENT'].plot.hist()
plt.title("Distribution of CNT_Instalment")
CNT_INSTALMENT_FUTURE: Installments left to pay on the previous credit
pos_cash['CNT_INSTALMENT_FUTURE'].plot.hist()
plt.title("Distribution of CNT_Instalment_Furture")
pos_cash['CNT_INSTALMENT_FUTURE'].median()
Months Balance: Month of balance relative to application date
(-1 means the information to the freshest monthly snapshot, 0 means the information at application - often it will be the same as -1 as many banks are not updating the information to Credit Bureau regularly )
pos_cash['MONTHS_BALANCE'].plot.hist()
plt.title("Distribution of MONTHS_BALANCE")
Overall view of Data Distribution
corr_matrix = pos_cash.corr()
corr_matrix
plt.subplots(figsize = (15,15))
sns.heatmap(pos_cash.corr(), cmap = 'viridis')
In this section, we are looking at ways we could combine or eliminate features that could be added to the application train dataset for modelling
Few questions that we can evaluate
pos_cash[(pos_cash['SK_ID_CURR'] == 100001) ].sort_values(by=['SK_ID_PREV', 'MONTHS_BALANCE'])
pos_cash['contract_status_binary'] = pos_cash['NAME_CONTRACT_STATUS'].map({'Active' : 0, 'Completed' : 1})
loan_status = pos_cash.groupby(['SK_ID_CURR', 'SK_ID_PREV'])['contract_status_binary'].sum().reset_index()
loan_status = loan_status.groupby('SK_ID_CURR')['contract_status_binary'].mean().reset_index()
loan_status.head(15)
act_contract_counts = (pos_cash.groupby('SK_ID_CURR')['contract_status_binary'].sum() / pos_cash.groupby('SK_ID_CURR')['contract_status_binary'].count()).reset_index()
act_contract_counts.columns = ['SK_ID_CURR', 'ACT_CONTRACTS']
act_contract_counts.head()
sns.kdeplot(act_contract_counts['ACT_CONTRACTS'])
curt_instal_mont_ratio = pos_cash.groupby('SK_ID_CURR', as_index=False)['CNT_INSTALMENT', 'MONTHS_BALANCE'].sum()
curt_instal_mont_ratio['Cur_Install_Mon_Ratio'] = curt_instal_mont_ratio['CNT_INSTALMENT'] / curt_instal_mont_ratio['MONTHS_BALANCE']
curt_instal_mont_ratio.drop(['CNT_INSTALMENT', 'MONTHS_BALANCE'], axis =1, inplace=True)
curt_instal_mont_ratio.head()
sns.kdeplot(curt_instal_mont_ratio['Cur_Install_Mon_Ratio'])
fut_instal_mont_ratio = pos_cash.groupby('SK_ID_CURR', as_index=False)['CNT_INSTALMENT_FUTURE', 'MONTHS_BALANCE'].sum()
fut_instal_mont_ratio['Fut_Install_Mon_Ratio'] = fut_instal_mont_ratio['CNT_INSTALMENT_FUTURE'] / fut_instal_mont_ratio['MONTHS_BALANCE']
fut_instal_mont_ratio.drop(['CNT_INSTALMENT_FUTURE', 'MONTHS_BALANCE'], axis =1, inplace=True)
fut_instal_mont_ratio.head()
features = pd.DataFrame({'SK_ID_CURR': pos_cash['SK_ID_CURR'].unique()})
pos_cash_sorted = pos_cash.sort_values(['SK_ID_CURR', 'MONTHS_BALANCE'])
group_object = pos_cash_sorted.groupby('SK_ID_CURR')['CNT_INSTALMENT_FUTURE'].last().reset_index()
group_object.rename(index=str,
columns={'CNT_INSTALMENT_FUTURE': 'REM_INSTALMENT'},
inplace=True)
group_object.head()
features[features['SK_ID_CURR'] == 100002]
pos_cash_sorted['REM_INSTALMENT'] = (pos_cash_sorted['SK_DPD'] > 0).astype(int)
pos_cash_sorted['REM_INSTALMENT_TOL'] = (pos_cash_sorted['SK_DPD_DEF'] > 0).astype(int)
groupby = pos_cash_sorted.groupby(['SK_ID_CURR'])['REM_INSTALMENT', 'REM_INSTALMENT_TOL'].sum().reset_index()
groupby.head()
def pos_data_generator(pos_cash):
#pos_cash = read_files(['POS_CASH_balance.csv'], path, print_details=False)[0]
# For every unique ID we want number of active contracts
pos_cash['contract_status_binary'] = pos_cash['NAME_CONTRACT_STATUS'].map({'Active' : 0, 'Completed' : 1})
pos = pos_cash.groupby(['SK_ID_CURR', 'SK_ID_PREV'])['contract_status_binary'].sum().reset_index()
pos = pos.groupby('SK_ID_CURR')['contract_status_binary'].mean().reset_index()
#pos = (pos_cash.groupby('SK_ID_CURR')['contract_status_binary'].sum() / pos_cash.groupby('SK_ID_CURR')['contract_status_binary'].count()).reset_index()
pos.columns = ['SK_ID_CURR', 'ACT_CONTRACTS']
# For this feature we can see the ration of Current Installment to the months ratio
grp = pos_cash.groupby('SK_ID_CURR', as_index=False)['CNT_INSTALMENT', 'MONTHS_BALANCE'].sum()
grp['Cur_Install_Mon_Ratio'] = grp['CNT_INSTALMENT'] / grp['MONTHS_BALANCE']
grp.drop(['CNT_INSTALMENT', 'MONTHS_BALANCE'], axis =1, inplace=True)
pos=pos.merge(grp,on='SK_ID_CURR',how='left')
# For this feature we can see the ration of furture Installment to the current installment
grp = pos_cash.groupby('SK_ID_CURR', as_index=False)['CNT_INSTALMENT_FUTURE', 'MONTHS_BALANCE'].sum()
grp['Fut_Install_Mon_Ratio'] = grp['CNT_INSTALMENT_FUTURE'] / grp['MONTHS_BALANCE']
grp.drop(['CNT_INSTALMENT_FUTURE', 'MONTHS_BALANCE'], axis =1, inplace=True)
pos=pos.merge(grp,on='SK_ID_CURR',how='left')
# For each loan what is the number of months instalmment pending
pos_cash_sorted = pos_cash.sort_values(['SK_ID_CURR', 'MONTHS_BALANCE'])
grp = pos_cash_sorted.groupby('SK_ID_CURR')['CNT_INSTALMENT_FUTURE'].last().reset_index()
grp.rename(index=str,columns={'CNT_INSTALMENT_FUTURE': 'REM_INSTALMENT'},inplace=True)
pos=pos.merge(grp,on='SK_ID_CURR',how='left')
# In past for each unique ID, how many times have we missed the due late and missed with tolerance
pos_cash_sorted['PAID_LATE'] = (pos_cash_sorted['SK_DPD'] > 0).astype(int)
pos_cash_sorted['PAID_LATE_TOL'] = (pos_cash_sorted['SK_DPD_DEF'] > 0).astype(int)
grp = pos_cash_sorted.groupby(['SK_ID_CURR'])['PAID_LATE', 'PAID_LATE_TOL'].mean().reset_index()
pos=pos.merge(grp,on='SK_ID_CURR',how='left')
pos.columns = ['POS_' + e if e != 'SK_ID_CURR' else e for e in pos.columns.to_list()]
return pos
df_pos_cash=pos_data_generator(pos_cash)
df_pos_cash.head()
df_pos_cash.describe()
#df.to_csv('/content/drive/My Drive/I526 Final Project/data/processed/pos_processed.csv',index=False)
df_pos_cash.to_csv('../Project/data/processed//pos_processed.csv',index=False)
df_pos_cash.columns.tolist()
app_train_i=app_train[['SK_ID_CURR', 'TARGET']]
app_train_df_pos_cash_merged = app_train_i.merge(df_pos_cash.reset_index(),
left_on='SK_ID_CURR', right_on='SK_ID_CURR',
how='left', validate='one_to_one')
corr_matrix = app_train_df_pos_cash_merged.corr()
corr_matrix["TARGET"].sort_values(ascending=False)
app_train_df_pos_cash_merged.head()
app_train_numeric = app_train_df_pos_cash_merged
numeric_index = app_train_numeric.isna().sum()[app_train_numeric.isna().sum()/len(app_train_numeric) <0.5].index[:6]
app_train_numeric = app_train_numeric[numeric_index]
app_train_numeric['TARGET'] = app_train_df_pos_cash_merged['TARGET']
g = app_train_numeric.groupby('TARGET')
app_train_numeric = g.apply(lambda x: x.sample(g.size().min()).reset_index(drop=True))
sns.pairplot(app_train_numeric, hue= 'TARGET')
Highlight of Pipeline is shown in above image. More details on coding and pre-processing steps can be found below. Preprocessed files are exported at the end of this pipeline. This pre-processed data is used for training machine learning model in PART 3
from google.colab import drive
drive.mount('/content/drive')
## Load necessary packages to building pipelines
import json
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from lightgbm import LGBMClassifier
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score, f1_score, roc_curve, auc, \
roc_auc_score, confusion_matrix, plot_roc_curve, plot_confusion_matrix
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer, IterativeImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)
from tqdm import tqdm
from functools import reduce
# Below funciton is to read data files in chucks and append the list of files
path = "/content/drive/My Drive/I526 Final Project/data/"
def read_files(list_of_files, path, print_details = True):
df_list = []
for file in list_of_files:
chunksize = 500000
filename = path + file
i = 1
for chunk in tqdm(pd.read_csv(filename, chunksize=chunksize, low_memory=False)):
df = chunk if i == 1 else pd.concat([df, chunk])
if print_details:
print('-->Read Chunk...', i)
i += 1
df_list.append(df)
print(file + " ... Read Completed")
print('*'*40)
return df_list
## Reading Application train and test file.
file_list = ['application_test.csv','application_train.csv']
#Read the data into two files - Application test and Application and train dataset
app_test, app_train = read_files(file_list, path, print_details=False)
In order to reduce runtime for hyper parameter tuning, we randomly sampled 50% of the data from Kaggle Training Dataset. We have also stratified our target variable. However, at end we have trained full Kaggle training set to generate final model.
#Drop Target from Train Test dataset
dfx = app_train.drop('TARGET', axis =1)
dfy = app_train['TARGET']
X_train, _, y_train, _ = train_test_split(dfx, dfy, train_size = 0.5, stratify = dfy, random_state = 32)
#Merge both X_train and Y-train datasets
df_sample = pd.concat([X_train, y_train], axis =1).reset_index(drop= True)
df_sample.head()
Below are the customized pipeline classes developed that were used to create new features, merging data from supplementry files and reduce dimensionality.
This will further feed into estimator pipeline.
Upon completion of the above activities data was exported as clean CSV file which is used to perform model evaluation
There are six different sources or supplementry files that were provided within Kaggle dataset.
We wanted to build on our baseline model by adding features from supplementary files that were provided on kaggle. For each supplementary file we conducted all the steps of data pre-processing including EDA, feature engineering and data imputing. The results of feature engineering were processed into a separate CSV called processed __. csv files. In the below function we merge all the processed csv file into one csv for further analysis.
Processing for Bureau and Bureau Balance files was conducted by Jugal, Previous application and installments payment was conducted by Deepak, Point of Sales Cash file by Gautham and credit card balance file by Andrew.
class Supp_Info_Adder(BaseEstimator, TransformerMixin):
def __init__(self): # no *args or **kargs
pass
def fit(self, X, y=None):
return self # nothing else to do
# All the processed files were merged together
def transform(self, X, y=None):
# df = X.copy()
dfs = read_files(['bureau_processed.csv', 'pos_processed.csv', 'cc_processed.csv',
'previous_application_processed.csv', 'installments_payments_processed.csv'],
'/content/drive/My Drive/I526 Final Project/data/processed/', print_details=False)
dfs.insert(0, X)
# Merge the data set by keeping the column SK_ID_CURR as primary key
merge_df = reduce(lambda left,right: pd.merge(left,right,on='SK_ID_CURR', how = 'left'), dfs)
print('supplemetery files merged to main train/test file............')
return merge_df
# merge_df is the overall dataset
supp = Supp_Info_Adder()
temp = supp.transform(df_sample)
As a part of Feature engineering each team member created functions that would extract added features that were created based on data pre -processing steps. Below functions combines all the modified features into one class called Feature_Adder
class Feature_Adder(TransformerMixin, BaseEstimator):
'''A template for a custom transformer.'''
def __init__(self):
pass
def fit(self, X, y=None):
return self
def transform(self, X):
# X = X.copy()
X['DAYS_EMPLOYED'].replace(365243, np.nan, inplace= True)
X['CODE_GENDER'].replace({'XNA': np.nan}, inplace=True)
X['NAME_INCOME_TYPE'] = X['NAME_INCOME_TYPE'].map(lambda x: x if x != 'Maternity leave' else np.nan)
X['NAME_FAMILY_STATUS'] = X['NAME_FAMILY_STATUS'].map(lambda x: x if x != 'Unknown' else np.nan)
building_columns = ['APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG',
'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG',
'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG',
'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG',
'NONLIVINGAREA_AVG']
flag_columns = [_f for _f in X.columns if 'FLAG_DOC' in _f]
live = [_f for _f in X.columns if ('FLAG_' in _f) & ('FLAG_DOC' not in _f) & ('_FLAG_' not in _f)]
drop_flag_columns = ['FLAG_DOCUMENT_2','FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5',
'FLAG_DOCUMENT_6','FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8',
'FLAG_DOCUMENT_9','FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11',
'FLAG_DOCUMENT_12','FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14',
'FLAG_DOCUMENT_15','FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17',
'FLAG_DOCUMENT_18','FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20',
'FLAG_DOCUMENT_21']
bureau_total_columns = ['AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK',
'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT',
'AMT_REQ_CREDIT_BUREAU_YEAR']
X['Annuity_Income'] = X['AMT_ANNUITY']/X['AMT_INCOME_TOTAL']
X['Income_Cred'] = X['AMT_CREDIT']/X['AMT_INCOME_TOTAL']
X['Income_PP'] = X['AMT_INCOME_TOTAL']/X['CNT_FAM_MEMBERS']
X['CHILDREN_RATIO'] = (1 + X['CNT_CHILDREN']) / X['CNT_FAM_MEMBERS']
X['INCOME_PER_CHILD'] = X['AMT_INCOME_TOTAL']/(1 + X['CNT_CHILDREN'])
X['PAYMENTS'] = X['AMT_ANNUITY']/ X['AMT_CREDIT']
X['NEW_CREDIT_TO_GOODS_RATIO'] = X['AMT_CREDIT'] / X['AMT_GOODS_PRICE']
X['GOODS_INCOME'] = X['AMT_GOODS_PRICE']/X['AMT_INCOME_TOTAL']
X['LOAN_ACESS'] = X['AMT_CREDIT'] - X['AMT_GOODS_PRICE']
X['INCOME_TO_EMPLOYED_RATIO'] = X['AMT_INCOME_TOTAL'] / X['DAYS_EMPLOYED']
X['INCOME_TO_BIRTH_RATIO'] = X['AMT_INCOME_TOTAL'] / X['DAYS_BIRTH']
X['CNT_NON_CHILD'] = X['CNT_FAM_MEMBERS'] - X['CNT_CHILDREN']
X['TERM'] = X['AMT_CREDIT'] / X['AMT_ANNUITY']
X['MEAN_BUILDING_SCORE_AVG'] = X[building_columns].mean(skipna=True, axis=1)
X['TOTAL_BUILDING_SCORE_AVG'] = X[building_columns].sum(skipna=True, axis=1)
X['NEW_DOC_TOTAL'] = X[flag_columns].sum(axis=1)
X['NEW_DOC_AVG'] = X[flag_columns].mean(axis=1)
X['NEW_DOC_STD'] = X[flag_columns].std(axis=1)
X['NEW_DOC_KURT'] = X[flag_columns].kurtosis(axis=1)
X['NEW_LIVE_SUM'] = X_train[live].sum(axis=1)
X['NEW_LIVE_STD'] = X[live].std(axis=1)
X['NEW_LIVE_KURT'] = X[live].kurtosis(axis=1)
X['EXTERNAL_SOURCE_WEIGHTED'] = X.EXT_SOURCE_1 * 2 + X.EXT_SOURCE_2 * 3 + X.EXT_SOURCE_3 * 4
X['AMT_REQ_CREDIT_BUREAU_TOTAL'] = X[bureau_total_columns].sum(axis=1)
X['AGE_RANGE'] = X['DAYS_BIRTH'].apply(lambda x: self._get_age_label(x))
inc_by_org = X[['AMT_INCOME_TOTAL', 'ORGANIZATION_TYPE']].groupby('ORGANIZATION_TYPE').median()['AMT_INCOME_TOTAL']
X['NEW_INC_BY_ORG'] = X['ORGANIZATION_TYPE'].map(inc_by_org)
X['NEW_EXT_SOURCES_MEAN'] = X[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].mean(axis=1)
X['OWN_CAR_AGE'].fillna(0, inplace = True)
# Bureau Columns missing data and new feature
X['Bureau_No_Record'] = X['BUREAU_LOAN_COUNT'].apply(lambda x : 1 if x != x else 0 )
b_columns = [_b for _b in X.columns if 'BUREAU_' in _b]
# X[b_columns].fillna(0, inplace = True)
X[b_columns] = X[b_columns].mask(X['Bureau_No_Record'] == 1, X[b_columns].fillna(0))
# BUREAU_INCOME_CREDIT_RATIO : Ration of Average Credit Amount to the total income of family or household
X['BUREAU_INCOME_CREDIT_RATIO'] = X['BUREAU_AVG_CREDIT_AMT'] / X['AMT_INCOME_TOTAL']
# Prev App Columns missing data and new feature
X['Prev_App_No_Record'] = X['PREV_APP_SK_ID_PREV_COUNT'].apply(lambda x : 1 if x != x else 0 )
pa_columns = [_pa for _pa in X.columns if 'PREV_APP_' in _pa]
# X[pa_columns].fillna(0, inplace = True)
X[pa_columns] = X[pa_columns].mask(X['Prev_App_No_Record'] == 1, X[pa_columns].fillna(0))
# Installment Pay Columns missing data and new feature
X['INST_No_Record'] = X['INST_PAY_SK_ID_PREV_COUNT'].apply(lambda x : 1 if x != x else 0 )
inst_columns = [_ins for _ins in X.columns if 'INST_PAY_' in _ins]
# X[in_columns].fillna(0, inplace = True)
X[inst_columns] = X[inst_columns].mask(X['INST_No_Record'] == 1, X[inst_columns].fillna(0))
# CreditCard Columns missing data and new feature
X['CC_No_Record'] = X['CC_SK_ID_PREV_NUNIQUE'].apply(lambda x : 1 if x != x else 0 )
cc_columns = [_cc for _cc in X.columns if 'CC_' in _cc]
# X[in_columns].fillna(0, inplace = True)
X[cc_columns] = X[cc_columns].mask(X['CC_No_Record'] == 1, X[cc_columns].fillna(0))
# POS CASH Columns missing data and new feature
X['POS_No_Record'] = X['POS_Cur_Install_Mon_Ratio'].apply(lambda x : 1 if x != x else 0 )
pos_columns = [_p for _p in X.columns if 'POS__' in _p]
# X[in_columns].fillna(0, inplace = True)
X[pos_columns] = X[pos_columns].mask(X['POS_No_Record'] == 1, X[pos_columns].fillna(0))
print("Feature Adder commpleted............")
return X
def _get_age_label(self, days_birth):
#Return the age group label (int).
age_years = -days_birth / 365
if age_years < 27: return 1
elif age_years < 40: return 2
elif age_years < 50: return 3
elif age_years < 65: return 4
elif age_years < 99: return 5
else: return 0
Correlation within the Added Features
After adding all features their correlation with respect to Target variable is analzed. Some features like External Source score and AGE Range are highly correlated.
Also, using Correlation heatmap we can see that several features are highly correlated and during dimensionality reduction steps these features needs to be dropped.
#Check how correlated the new features are with respect to the TARGET
FEA_N.corr().abs().sort_values('TARGET', ascending=False)['TARGET']
#Lets try to look at this visually on heatmap
plt.subplots(figsize = (15,15))
sns.heatmap(FEA_N.corr().abs(), cmap = 'viridis')
#based on the correlation between features we would go ahead and reduce dimensionality using high r value (greater than 0.9)
While creating new features, we created few ratios that would add meaningful and significant features to the overall dataset. During creation of ratios we noticed some unusual and infinite values that needed to be removed.This is basically due to few missing values in the orginal dataset contributing to the infinite values.
Below function address the aforementioned problem
class Remove_abnormal(TransformerMixin, BaseEstimator):
'''A template for a custom transformer.'''
def __init__(self):
self.columns_to_drop = []
pass
def fit(self, X, y=None):
return self
def transform(self, X):
# transform X via code or additional methods
# X = X.copy()
X.replace(np.inf, np.nan, inplace = True)
X.replace(-np.inf, np.nan, inplace = True)
print("Abnormality removed............")
return X
We use three methods to reduce the number of features in the dataset
class DimReduction(TransformerMixin, BaseEstimator):
'''A template for a custom transformer.'''
def __init__(self):
self.columns_to_drop = []
pass
def fit(self, X, y=None):
# Find feature where the missing values are more than 60%
missing_columns = X.columns[X.isna().mean() > 0.6].to_list()
# Find feature where variance is very low
columns_with_small_variance = list(X.std()[(X.std() < 0.01)].index)
# Remove Highly correlated features
corr_matrix = X.corr().abs()
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
tri_df = corr_matrix.mask(mask)
high_correlated_columns = [c for c in tri_df.columns
if any(tri_df[c] > 0.9)]
self.columns_to_drop = list(set(missing_columns +
columns_with_small_variance +
high_correlated_columns))
self.columns_to_drop.append('SK_ID_CURR')
return self
def transform(self, X):
# transform X via code or additional methods
X = X.drop(self.columns_to_drop, axis=1)
print("Dimensionality reduced............")
return X
We further clean up and prepare the overall dataset by following transformation:
class CustomImputer(TransformerMixin, BaseEstimator):
'''A template for a custom transformer.'''
def __init__(self):
self.cat = SimpleImputer(strategy='most_frequent')
self.disnum = SimpleImputer(strategy='most_frequent')
self.num = SimpleImputer(strategy='median')
# self.ohe = OneHotEncoder(handle_unknown='ignore'))
## Column Lists
self.cat_list = []
self.dis_num_list = []
self.num_list = []
def fit(self, X, y=None):
self.cat_list, self.dis_num_list, self.num_list = self._feature_type_split(X)
self.cat.fit(X[self.cat_list])
self.disnum.fit(X[self.dis_num_list])
self.num.fit(X[self.num_list])
return self
def transform(self, X):
# transform X via code or additional methods
# X = X.copy()
# Categorical List
X[self.cat_list] = self.cat.transform(X[self.cat_list])
# Distinct Numerical List
X[self.dis_num_list] = self.disnum.transform(X[self.dis_num_list])
# Continous Numerical List
X[self.num_list] = self.num.transform(X[self.num_list])
print("Imputing completed............")
return X
def _feature_type_split(self, X):
cat_list = []
dis_num_list = []
num_list = []
df = X.copy()
# df= df.drop('SK_ID_CURR', axis =1)
for i in df.columns.tolist():
if i != 'TARGET':
if df[i].dtype == 'object':
cat_list.append(i)
elif df[i].nunique() < 25:
dis_num_list.append(i)
else:
num_list.append(i)
print('Columns split completed............')
return cat_list, dis_num_list, num_list
For categorical features we also performed One-Hot encoding to distinguish between features and use it effectively for our model building.
Below we a class to perform the Binary Hot encoding on the dataset
class Binary_Labelizer(TransformerMixin, BaseEstimator):
'''A template for a custom transformer.'''
def __init__(self):
self.columns_to_drop = []
pass
def fit(self, X, y=None):
return self
def transform(self, X):
X = pd.get_dummies(X, drop_first=True)
print("One Hot Encoding Performed............")
return X
Now that all the above class defined, we will put it inside pipeline and test it using Pipeline Test above class using the below pipeline
data_pipeline = Pipeline([('merger', Supp_Info_Adder()),
('feature_adder', Feature_Adder()),
('abnormal_data', Remove_abnormal()),
('dimensionality_reduction', DimReduction()),
('custom_imputer', CustomImputer()),
('ohe', Binary_Labelizer()),
])
We take the stratified train data and fit the values. This is 50% of the data, which would be further used for HyperParamter tunning in Part 3
result_train = data_pipeline.fit_transform(df_sample)
# Describe
print("Shape of final train file is : " , result_train.shape)
result_train.head()
Here we peform the transform on the test dataset based on the paramters that we gained using the fit funciton on train dataset
result_test = data_pipeline.transform(app_test)
print("Shape of final test file is : " , result_test.shape)
result_test.head()
After fit and tranform we extract these to files into a CSV file and save on the local drive
result_train.to_csv(path + 'Final_Files/train_final.csv', index = False)
result_test.to_csv(path + 'Final_Files/test_final.csv', index = False)
Here we devlop kaggle train dataset that would help us in Hypeparameter tunning and assist us in devloping a robust model for our Kaggle
full_train = data_pipeline.fit_transform(app_train)
full_train.shape
#Export data into local drive for further model building
full_train.to_csv(path + 'Final_Files/Full_train.csv', index = False)
This part we have developed end to end pre-processing pipelines and transformed datasets for further model building.
Final Preprocessed files have 311 features.For the original 121 features from the original app train dataset, we have added 190 features for further analysis.
Part 3, includes further feature engineering and developing model using this transformed dataset
PART 3 of Phase2 project includes several model evaluation. During Phase1 we have tested Logistic Regression, Random Forest and Gradient Boosting Classifiers. Based on phase1 result, it was clear that gradient boosting decision tree models where performing better than logistic regression or random forest model.
Following case studies were evaluated for this phase. There are more than 25000 experiments performed using gridsearch and various pre-processing and balancing ratios.
| Model | XGBOOST | LGBM | XGBOOST KAGGLE SCORE | LGBM KAGGLE SCORE |
|---|---|---|---|---|
| Unbalanced Dataset | 360 Experiments | 3240 Experiments | 0.792 | 0.787 |
| Manual Balanced Dataset (0.25 ratio) | 360 Experiments | 3240 Experiments | 0.779 | 0.790 |
| Manual Balanced Dataset (0.33 ratio) | 360 Experiments | 3240 Experiments | 0.790 | 0.791 |
| Manual Balanced Dataset (0.5 ratio) | 360 Experiments | 3240 Experiments | 0.783 | 0.789 |
| Numerically Scaled - Unbalanced Dataset | 360 Experiments | 3240 Experiments | 0.789 | 0.787 |
| PCA - Unbalanced Dataset | - | 180 Experiments | - | 0.600 |
| SMOTE - Synthetic Balance | 360 Experiments | 9720 Experiments | 0.773 | 0.794 |
Out of All Case studies, Best model based on Kaggle score is Synthetically balanced dataset with LightGBM with kaggle score 0.794. Second closet results is unbalanced dataset. Using LightGBM, kaggle score for second best model is 0.792
For list of complete experiment logs refer Discussion of Results
import json
from sklearn.model_selection import GridSearchCV, train_test_split, ParameterGrid
from sklearn.linear_model import LogisticRegression
# from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from lightgbm import LGBMClassifier
# from catboost import CatBoostClassifier
from xgboost import XGBClassifier
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score, f1_score, roc_curve, auc, \
roc_auc_score, confusion_matrix, plot_roc_curve, plot_confusion_matrix
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer, IterativeImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)
from tqdm import tqdm
from time import time
from datetime import datetime
pd.set_option('display.max_columns', 500)
path = "data/final_processed/"
def read_files(list_of_files, path, print_details = True):
df_list = []
for file in list_of_files:
chunksize = 500000
filename = path + file
i = 1
for chunk in tqdm(pd.read_csv(filename, chunksize=chunksize, low_memory=False)):
df = chunk if i == 1 else pd.concat([df, chunk])
if print_details:
print('-->Read Chunk...', i)
i += 1
df_list.append(df)
print(file + " ... Read Completed")
print('*'*40)
return df_list
## Reading Application train and test file.
file_list = ['train_final.csv', 'test_final.csv']
#app_test, app_train, installment, bureau_balance, pos_cash, bureau, prev_app, cc_bal = read_files(file_list, path, print_details=False)
df_train, df_test = read_files(file_list, path, print_details=False)
Currently pre-processed files have 310 features. One column represent TARGET variable.
print('Train Data Shape is :', df_train.shape)
print('Test Data Shape is :', df_test.shape)
df_train.head()
df_test.head()
Pre-processed file is large in datasize and it may slow down the process. Below function reduces memory of this dataframe. As we can see, memory usage for training set was 365MB. After reduction it is 88MB. Similarly for test dataset memmory usage is reduced from 115MB to 28MB.
def reduce_memory(df):
start_mem = df.memory_usage().sum() / 1024**2
print('memory usage is ' , round(start_mem), 'MB')
for col in df.columns:
col_type = df[col].dtype
if col_type != object:
c_min = df[col].min()
c_max = df[col].max()
if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)
else:
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
df[col] = df[col].astype(np.float16)
elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float64)
else:
df[col] = df[col].astype('category')
end_mem = df.memory_usage().sum() / 1024**2
print('end memory usage is ' , round(end_mem), 'MB')
return df
df_train = reduce_memory(df_train)
df_test = reduce_memory(df_test)
During pre-processing steps we have reduced dimension of dataset. However, it is possible to reduce it further. We can do this by performing feature importance and remove all feature which has zero importance.
In order to be more conservative, two models XGBOOST and LightGBM were fitted and removed only features which has zero importance in both models. Also for each model, two fold validation performed. This way we get very robust values for feature importance and we will drop features that has zero importance.
Best performing features are
Term of Loan (Engineered Feature)
Mean of External Sources (Engineered Feature)
Length of Employment (Given)
External Source Score (Given)
Future monthly installment ratio (Engoneered)
Anuuity Amount
Several features that looks important at first sight were given zero importance by model in order to predict outcome.
Some of the features that were given zero importance by models are
Industry where prople work
Requirement of certain document
Whether person is unemployed or student, as long as he has income
Whether person is first time applicant or not, it does not matter.
X = df_train.drop('TARGET', axis = 1).values
y = df_train['TARGET'].values
X_train, X_test, y_train, y_test = train_test_split(X, y , stratify = y,
test_size = 0.3, random_state = 42)
# Initialize an empty array to hold feature importances
## WE WILL DOUBLE FIT IN ORDER TO BE MORE CONSERVATIVE
models = [LGBMClassifier(objective='binary', boosting_type = 'goss', n_estimators = 10000, class_weight = 'balanced'),
XGBClassifier(objective='binary:logistic', n_estimators = 10000,)]
feature_importances = np.zeros(X_train.shape[1])
# Create the model with several hyperparameters
for model in models:
for i in range(2):
# Split into training and validation set
train_features, valid_features, train_y, valid_y = train_test_split(X_train, y_train, test_size = 0.25, random_state = i*42)
# Train using early stopping
model.fit(train_features, train_y, early_stopping_rounds=100, eval_set = [(valid_features, valid_y)],
eval_metric = 'auc', verbose = 200)
# Record the feature importances
feature_importances += model.feature_importances_
# Make sure to average feature importances!
feature_importances = feature_importances / 4
feature_importances = pd.DataFrame({'feature': list(df_train.drop('TARGET', axis=1).columns), 'importance': feature_importances}).sort_values('importance', ascending = False)
display(feature_importances.head())
# Find the features with zero importance
zero_features = list(feature_importances[feature_importances['importance'] == 0.0]['feature'])
print('There are %d features with 0.0 importance' % len(zero_features))
display(feature_importances.tail(20))
def plot_feature_importances(df, threshold = 0.9):
"""
Plots 15 most important features and the cumulative importance of features.
Prints the number of features needed to reach threshold cumulative importance.
Parameters
--------
df : dataframe
Dataframe of feature importances. Columns must be feature and importance
threshold : float, default = 0.9
Threshold for prining information about cumulative importances
Return
--------
df : dataframe
Dataframe ordered by feature importances with a normalized column (sums to 1)
and a cumulative importance column
"""
plt.rcParams['font.size'] = 18
# Sort features according to importance
df = df.sort_values('importance', ascending = False).reset_index()
# Normalize the feature importances to add up to one
df['importance_normalized'] = df['importance'] / df['importance'].sum()
df['cumulative_importance'] = np.cumsum(df['importance_normalized'])
# Make a horizontal bar chart of feature importances
plt.figure(figsize = (10, 6))
ax = plt.subplot()
# Need to reverse the index to plot most important on top
ax.barh(list(reversed(list(df.index[:15]))),
df['importance_normalized'].head(15),
align = 'center', edgecolor = 'k')
# Set the yticks and labels
ax.set_yticks(list(reversed(list(df.index[:15]))))
ax.set_yticklabels(df['feature'].head(15))
# Plot labeling
plt.xlabel('Normalized Importance'); plt.title('Feature Importances')
plt.show()
# Cumulative importance plot
plt.figure(figsize = (8, 6))
plt.plot(list(range(len(df))), df['cumulative_importance'], 'r-')
plt.xlabel('Number of Features'); plt.ylabel('Cumulative Importance');
plt.title('Cumulative Feature Importance');
plt.show();
importance_index = np.min(np.where(df['cumulative_importance'] > threshold))
print('%d features required for %0.2f of cumulative importance' % (importance_index + 1, threshold))
return df
norm_feature_importances = plot_feature_importances(feature_importances)
df_train_reduced = df_train.drop(zero_features, axis=1)
X = df_train_reduced.drop('TARGET', axis = 1).values
y = df_train_reduced['TARGET'].values
X_train, X_test, y_train, y_test = train_test_split(X, y , stratify = y,
test_size = 0.3, random_state = 42)
X_train.shape
We will evalaute model using grid search with several hyper parameters. It is also observed that during GRID Search XGBoost performs slower and takes more time. No. of experiments are adjusted for both models based on time constraint.
The tree ensemble model consists of a set of classification and regression trees (CART). Usually, a single tree is not strong enough to be used in practice. What is actually used is the ensemble model, which sums the prediction of multiple trees together.
The prediction scores of each individual tree are summed up to get the final score.
Mathematically, we can write our model in the form
\begin{equation*} \hat y_i = \left( \sum_{k=1}^K f_k(x_i) \right) , f_k \in F \end{equation*}Where $K$ is number of trees, $f$ is function in functional space $F$, and $F$ is the set of all possible CARTs.
Objective or loss funtion is
\begin{equation*} obj(θ)=∑l(y_i, \hat y_i)+ \left(\sum_{k=1}^K Ω(fk) \right) \end{equation*}XGBoost and LightGBM are the packages belong to the family of gradient boosting decision trees (GBDTs).
XGBoost is an implementation of gradient boosted decision trees designed for speed and performance. XGBoost stands for Extremen Gradient Boosting. It uses a library of gradient boosting decsion tree algorithm. Boosting is an ensemble technique where new models are added to correct the errors made by existing models. Models are added sequentially until no further improvements can be made.
XGBoost is also known as the regularised version of GBM. This framework includes built-in L1 and L2 regularisation which means it can prevent a model from overfitting.
Light Gradient Boosting or LightGBM is a highly efficient gradient boosting decision tree algorithm. It is similar to XGBoost and varies when it comes to the method of creating trees. This algorithm constructs trees leaf-wise in a best-first order due to which there is a tendency to achieve lower loss.
ACCURACY SCORE
Accuracy is one metric for evaluating classification models. Informally, accuracy is the fraction of predictions our model got right. Formally, accuracy has the following definition:
\begin{equation*} Accuracy = \frac{Number\; of\; Correct\; Prediction}{Total\; Number\; of\; Predictions} \end{equation*}ROC and AUC Score
An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters:
True Positive Rate
False Positive Rate
True Positive Rate (TPR) is a synonym for recall and is therefore defined as follows:
\begin{equation*} TPR = \frac{TP}{TP + FN} \end{equation*}Flase Positive Rate (FPR) is defined as follows:
\begin{equation*} FPR = \frac{FP}{FP + TN} \end{equation*}AUC stands for "Area under the ROC Curve."
After some theory, we will develop functions to perform gridsearch. Although it is possible to develop a one function that can handle all of case studies, as a team we divided task of gridsearch among us and we have developed several function to perform gridsearch
# A Function to execute the grid search and record the results.
def ConductGridSearch(X_train, y_train, X_test, y_test, i=0, prefix='', n_jobs=-1,verbose=1):
# Create Exp Logbook
explog = pd.DataFrame(columns= ['Model', 'Train Accuracy Score', 'Train AUC Score', 'Test Accuracy Score',
'Test AUC Score','Training Time', 'Experiments', 'Description'])
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train , stratify = y_train,
test_size = 0.2, random_state = 42)
# Create a list of classifiers for our grid search experiment
classifiers = [
('XGBoost', XGBClassifier(random_state = 42)),
('LightGBM', LGBMClassifier(random_state=42)),
]
# Arrange grid search parameters for each classifier
params_grid = {
'XGBoost': {
'n_estimators' : [500, 1000],
'eta': [0.01, 0.05, 0.1],
'min_child_weight' : [20, 35],
'colsample_bytree': [0.8, 0.95],
# 'reg_alpha' : [0.01, 0.05, 0.1],
# 'reg_lambda' : [0.01, 0.05, 0.1],
'max_depth' : [3,6,8],
'verbosity': [0],
'verbose_eval': [False],
},
'LightGBM': {
'n_estimators' : [5000, 10000],
'boosting_type': ['gbdt'],
'num_leaves': [30, 35],
'learning_rate': [0.01, 0.05, 0.1],
'colsample_bytree': [0.8, 0.95 ],
'is_unbalance': [True],
'reg_alpha' : [0.01, 0.02, 0.05],
'reg_lambda' : [0.01, 0.02, 0.05],
'min_split_gain' :[0.01, 0.02, 0.05],
},
}
fit_parms_grid = {
'XGBoost': {
"early_stopping_rounds":50,
"eval_metric" : 'auc',
"eval_set" : [(X_val,y_val)],
},
'LightGBM': {
"early_stopping_rounds":50,
"eval_metric" : 'auc',
"eval_set" : [(X_val,y_val)],
'eval_names': ['valid'],
'verbose': -1,
}
}
for (name, classifier) in classifiers:
i += 1
# Print classifier and parameters
print('****** START',prefix, name,'*****')
parameters = params_grid[name]
fit_parameter = fit_parms_grid[name]
print("Parameters:")
for p in sorted(parameters.keys()):
print("\t"+str(p)+": "+ str(parameters[p]))
total_exp = len(ParameterGrid(parameters))*5
# generate the pipeline
full_pipeline_with_predictor = Pipeline([
# ("preparation", data_pipeline),
("predictor", classifier)
])
# Execute the grid search
params = {}
for p in parameters.keys():
pipe_key = 'predictor__'+str(p)
params[pipe_key] = parameters[p]
fit_params = {}
for f in fit_parameter.keys():
fit_key = 'predictor__'+str(f)
fit_params[fit_key] = fit_parameter[f]
grid_search = GridSearchCV(full_pipeline_with_predictor, params, scoring='roc_auc', cv=5,
n_jobs=n_jobs, verbose=verbose)
grid_search.fit(X_train, y_train, **fit_params)
# # Best estimator score
# best_train = pct(grid_search.best_score_)
# Best estimator fitting time
start = time()
grid_search.best_estimator_.fit(X_train, y_train)
Train_time = round(time() - start, 4)
# Best estimator prediction time
# y_pred_train = grid_search.best_estimator_.predict(X_train)
# y_pred = grid_search.best_estimator_.predict(X_test)
accuracy_train = round(accuracy_score(y_train,grid_search.best_estimator_.predict(X_train)), 3)
roc_train = round(roc_auc_score(y_train, grid_search.best_estimator_.predict_proba(X_train)[:, 1]), 3)
accuracy_test = round(accuracy_score(y_test,grid_search.best_estimator_.predict(X_test)), 3)
roc_test = round(roc_auc_score(y_test, grid_search.best_estimator_.predict_proba(X_test)[:, 1]), 3)
print('*'*40)
print('\n')
title = name + " - Normalized Confusion Matrix"
disp = plot_confusion_matrix(grid_search.best_estimator_,X_test, y_test, normalize='true')
disp.ax_.set_title(title)
plt.show()
print('-'*40)
print('\n')
disp2 = plot_roc_curve(grid_search.best_estimator_, X_test, y_test, )
disp2.ax_.set_title ("ROC curve - " + name)
plt.show()
print('*'*40)
print('\n')
# Collect the best parameters found by the grid search
print("Best Parameters:")
best_parameters = grid_search.best_estimator_.get_params()
param_dump = []
for param_name in sorted(params.keys()):
param_dump.append((param_name, best_parameters[param_name]))
print("\t"+str(param_name)+": " + str(best_parameters[param_name]))
print("****** FINISH",prefix,name," *****")
print("")
print("-"*40)
print("*"*40)
print("-"*40)
# Record the results
explog.loc[len(explog)] = [prefix+name, accuracy_train, roc_train,
accuracy_test, roc_test, Train_time, total_exp,
json.dumps(param_dump)]
sttime = datetime.now().strftime('%Y%m%d_%H:%M:%S - ')
display(explog)
explog.to_csv(sttime + 'experiment_log.csv', index = False)
# A Function to execute the grid search and record the results.
def SMOTE_ConductGridSearch(X_train, y_train, X_test, y_test, i=0, prefix='', n_jobs=-1,verbose=1):
# Create Exp Logbook
explog = pd.DataFrame(columns= ['Model', 'Train Accuracy Score', 'Train AUC Score', 'Test Accuracy Score',
'Test AUC Score','Training Time', 'Experiments', 'Description'])
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train , stratify = y_train,
test_size = 0.2, random_state = 42)
# Create a list of classifiers for our grid search experiment
classifiers = [
('XGBoost', XGBClassifier(random_state = 42)),
('LightGBM', LGBMClassifier(random_state=42)),
]
# Arrange grid search parameters for each classifier
params_grid = {
'XGBoost': {
'n_estimators' : [500, 1000],
'eta': [0.01, 0.05, 0.1],
'min_child_weight' : [20, 35],
'colsample_bytree': [0.8],
'reg_alpha' : [0.01],
'reg_lambda' : [0.05],
'max_depth' : [6,8],
'verbosity': [0],
'verbose_eval': [False],
},
'LightGBM': {
'n_estimators' : [5000, 10000],
'boosting_type': ['gbdt'],
'num_leaves': [30, 35],
'learning_rate': [0.01, 0.05, 0.1],
'colsample_bytree': [0.8, 0.95 ],
'is_unbalance': [True],
'reg_alpha' : [0.01, 0.02, 0.05],
'reg_lambda' : [0.01, 0.02, 0.05],
'min_split_gain' :[0.01, 0.02, 0.05],
},
}
fit_parms_grid = {
'XGBoost': {
"early_stopping_rounds":50,
"eval_metric" : 'auc',
"eval_set" : [(X_val,y_val)],
},
'LightGBM': {
"early_stopping_rounds":50,
"eval_metric" : 'auc',
"eval_set" : [(X_val,y_val)],
'eval_names': ['valid'],
'verbose': -1,
}
}
for (name, classifier) in classifiers:
i += 1
# Print classifier and parameters
print('****** START',prefix, name,'*****')
parameters = params_grid[name]
fit_parameter = fit_parms_grid[name]
total_exp = len(ParameterGrid(parameters))*5
# generate the pipeline
full_pipeline_with_predictor = imbpipe([
('sampling', SMOTE(random_state=42)),
("predictor", classifier)
])
# Execute the grid search
params = {'sampling__sampling_strategy' : [0.25,0.5,'auto']}
for p in parameters.keys():
pipe_key = 'predictor__'+str(p)
params[pipe_key] = parameters[p]
print("Parameters:")
for p in sorted(params.keys()):
print("\t"+str(p)+": "+ str(params[p]))
fit_params = {}
for f in fit_parameter.keys():
fit_key = 'predictor__'+str(f)
fit_params[fit_key] = fit_parameter[f]
grid_search = GridSearchCV(full_pipeline_with_predictor, params, scoring='roc_auc', cv=5,
n_jobs=n_jobs, verbose=verbose)
grid_search.fit(X_train, y_train, **fit_params)
# # Best estimator score
# best_train = pct(grid_search.best_score_)
# Best estimator fitting time
start = time()
grid_search.best_estimator_.fit(X_train, y_train)
Train_time = round(time() - start, 4)
# Best estimator prediction time
# y_pred_train = grid_search.best_estimator_.predict(X_train)
# y_pred = grid_search.best_estimator_.predict(X_test)
accuracy_train = round(accuracy_score(y_train,grid_search.best_estimator_.predict(X_train)), 3)
roc_train = round(roc_auc_score(y_train, grid_search.best_estimator_.predict_proba(X_train)[:, 1]), 3)
accuracy_test = round(accuracy_score(y_test,grid_search.best_estimator_.predict(X_test)), 3)
roc_test = round(roc_auc_score(y_test, grid_search.best_estimator_.predict_proba(X_test)[:, 1]), 3)
print('*'*40)
print('\n')
title = name + " - Normalized Confusion Matrix"
disp = plot_confusion_matrix(grid_search.best_estimator_,X_test, y_test, normalize='true')
disp.ax_.set_title(title)
plt.show()
print('-'*40)
print('\n')
disp2 = plot_roc_curve(grid_search.best_estimator_, X_test, y_test, )
disp2.ax_.set_title ("ROC curve - " + name)
plt.show()
print('*'*40)
print('\n')
# Collect the best parameters found by the grid search
print("Best Parameters:")
best_parameters = grid_search.best_estimator_.get_params()
param_dump = []
for param_name in sorted(params.keys()):
param_dump.append((param_name, best_parameters[param_name]))
print("\t"+str(param_name)+": " + str(best_parameters[param_name]))
print("****** FINISH",prefix,name," *****")
print("")
print("-"*40)
print("*"*40)
print("-"*40)
# Record the results
explog.loc[len(explog)] = [prefix+name, accuracy_train, roc_train,
accuracy_test, roc_test, Train_time, total_exp,
json.dumps(param_dump)]
sttime = datetime.now().strftime('%Y%m%d_%H:%M:%S - ')
display(explog)
explog.to_csv(sttime + 'experiment_log.csv', index = False)
# A Function to execute the grid search and record the results.
def PCA_ConductGridSearch(X_train, y_train, X_test, y_test, i=0, prefix='', n_jobs=-1,verbose=1):
# Create Exp Logbook
explog = pd.DataFrame(columns= ['Model', 'Train Accuracy Score', 'Train AUC Score', 'Test Accuracy Score',
'Test AUC Score','Training Time', 'Experiments', 'Description'])
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train , stratify = y_train,
test_size = 0.2, random_state = 42)
# Create a list of classifiers for our grid search experiment
classifiers = [
('XGBoost', XGBClassifier(random_state = 42)),
('LightGBM', LGBMClassifier(random_state=42)),
]
# Arrange grid search parameters for each classifier
params_grid = {
'XGBoost': {
'n_estimators' : [500, 1000],
'eta': [0.01, 0.05, 0.1],
'min_child_weight' : [20, 35],
'colsample_bytree': [0.8, 0.95],
# 'reg_alpha' : [0.01, 0.05, 0.1],
# 'reg_lambda' : [0.01, 0.05, 0.1],
'max_depth' : [3,6,8],
'verbosity': [0],
'verbose_eval': [False],
},
'LightGBM': {
'n_estimators' : [5000, 10000],
'boosting_type': ['gbdt'],
'num_leaves': [30, 35],
'learning_rate': [0.01, 0.05, 0.1],
'colsample_bytree': [0.8, 0.95 ],
'is_unbalance': [True],
'reg_alpha' : [0.01, 0.02, 0.05],
'reg_lambda' : [0.01, 0.02, 0.05],
'min_split_gain' :[0.01, 0.02, 0.05],
},
}
fit_parms_grid = {
'XGBoost': {
"early_stopping_rounds":50,
"eval_metric" : 'auc',
"eval_set" : [(X_val,y_val)],
},
'LightGBM': {
"early_stopping_rounds":50,
"eval_metric" : 'auc',
"eval_set" : [(X_val,y_val)],
'eval_names': ['valid'],
'verbose': -1,
}
}
for (name, classifier) in classifiers:
i += 1
# Print classifier and parameters
print('****** START',prefix, name,'*****')
parameters = params_grid[name]
fit_parameter = fit_parms_grid[name]
total_exp = len(ParameterGrid(parameters))*5
# generate the pipeline
full_pipeline_with_predictor = Pipeline([
('pca', PCA()),
("predictor", classifier)
])
# Execute the grid search
params = {'pca__n_components' : [0.9,0.95,0.99]}
for p in parameters.keys():
pipe_key = 'predictor__'+str(p)
params[pipe_key] = parameters[p]
print("Parameters:")
for p in sorted(params.keys()):
print("\t"+str(p)+": "+ str(params[p]))
fit_params = {}
for f in fit_parameter.keys():
fit_key = 'predictor__'+str(f)
fit_params[fit_key] = fit_parameter[f]
grid_search = GridSearchCV(full_pipeline_with_predictor, params, scoring='roc_auc', cv=5,
n_jobs=n_jobs, verbose=verbose)
grid_search.fit(X_train, y_train, **fit_params)
# # Best estimator score
# best_train = pct(grid_search.best_score_)
# Best estimator fitting time
start = time()
grid_search.best_estimator_.fit(X_train, y_train)
Train_time = round(time() - start, 4)
# Best estimator prediction time
# y_pred_train = grid_search.best_estimator_.predict(X_train)
# y_pred = grid_search.best_estimator_.predict(X_test)
accuracy_train = round(accuracy_score(y_train,grid_search.best_estimator_.predict(X_train)), 3)
roc_train = round(roc_auc_score(y_train, grid_search.best_estimator_.predict_proba(X_train)[:, 1]), 3)
accuracy_test = round(accuracy_score(y_test,grid_search.best_estimator_.predict(X_test)), 3)
roc_test = round(roc_auc_score(y_test, grid_search.best_estimator_.predict_proba(X_test)[:, 1]), 3)
print('*'*40)
print('\n')
title = name + " - Normalized Confusion Matrix"
disp = plot_confusion_matrix(grid_search.best_estimator_,X_test, y_test, normalize='true')
disp.ax_.set_title(title)
plt.show()
print('-'*40)
print('\n')
disp2 = plot_roc_curve(grid_search.best_estimator_, X_test, y_test, )
disp2.ax_.set_title ("ROC curve - " + name)
plt.show()
print('*'*40)
print('\n')
# Collect the best parameters found by the grid search
print("Best Parameters:")
best_parameters = grid_search.best_estimator_.get_params()
param_dump = []
for param_name in sorted(params.keys()):
param_dump.append((param_name, best_parameters[param_name]))
print("\t"+str(param_name)+": " + str(best_parameters[param_name]))
print("****** FINISH",prefix,name," *****")
print("")
print("-"*40)
print("*"*40)
print("-"*40)
# Record the results
explog.loc[len(explog)] = [prefix+name, accuracy_train, roc_train,
accuracy_test, roc_test, Train_time, total_exp,
json.dumps(param_dump)]
sttime = datetime.now().strftime('%Y%m%d_%H:%M:%S - ')
display(explog)
explog.to_csv(sttime + 'experiment_log.csv', index = False)
class Custom_scaling(TransformerMixin, BaseEstimator):
'''A template for a custom transformer.'''
def __init__(self):
self.scale = StandardScaler()
# self.ohe = OneHotEncoder(handle_unknown='ignore'))
## Column Lists
self.cat_list = []
self.dis_num_list = []
self.num_list = []
def fit(self, X, y=None):
df = pd.DataFrame(X)
self.cat_list, self.dis_num_list, self.num_list = self._feature_type_split(df)
self.scale.fit(df[self.num_list])
return self
def transform(self, X):
# transform X via code or additional methods
df = pd.DataFrame(X)
# Continous Numerical List
df[self.num_list] = self.scale.transform(df[self.num_list])
print("Scaling completed............")
return df.values
def _feature_type_split(self, X):
cat_list = []
dis_num_list = []
num_list = []
df = X.copy()
# df= df.drop('SK_ID_CURR', axis =1)
for i in df.columns.tolist():
if i != 'TARGET':
if df[i].dtype == 'object':
cat_list.append(i)
elif df[i].nunique() < 25:
dis_num_list.append(i)
else:
num_list.append(i)
print('Columns split completed............')
return cat_list, dis_num_list, num_list
# A Function to execute the grid search and record the results.
def Scaled_ConductGridSearch(X_train, y_train, X_test, y_test, i=0, prefix='', n_jobs=-1,verbose=1):
# Create Exp Logbook
explog = pd.DataFrame(columns= ['Model', 'Train Accuracy Score', 'Train AUC Score', 'Test Accuracy Score',
'Test AUC Score','Training Time', 'Experiments', 'Description'])
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train , stratify = y_train,
test_size = 0.2, random_state = 42)
# Create a list of classifiers for our grid search experiment
classifiers = [
('XGBoost', XGBClassifier(random_state = 42)),
('LightGBM', LGBMClassifier(random_state=42)),
]
# Arrange grid search parameters for each classifier
params_grid = {
'XGBoost': {
'n_estimators' : [500, 1000],
'eta': [0.01, 0.05, 0.1],
'min_child_weight' : [20, 35],
'colsample_bytree': [0.8, 0.95],
# 'reg_alpha' : [0.01, 0.05, 0.1],
# 'reg_lambda' : [0.01, 0.05, 0.1],
'max_depth' : [3,6,8],
'verbosity': [0],
'verbose_eval': [False],
},
'LightGBM': {
'n_estimators' : [5000, 10000],
'boosting_type': ['gbdt'],
'num_leaves': [30, 35],
'learning_rate': [0.01, 0.05, 0.1],
'colsample_bytree': [0.8, 0.95 ],
'is_unbalance': [True],
'reg_alpha' : [0.01, 0.02, 0.05],
'reg_lambda' : [0.01, 0.02, 0.05],
'min_split_gain' :[0.01, 0.02, 0.05],
},
}
fit_parms_grid = {
'XGBoost': {
"early_stopping_rounds":50,
"eval_metric" : 'auc',
"eval_set" : [(X_val,y_val)],
},
'LightGBM': {
"early_stopping_rounds":50,
"eval_metric" : 'auc',
"eval_set" : [(X_val,y_val)],
'eval_names': ['valid'],
'verbose': -1,
}
}
for (name, classifier) in classifiers:
i += 1
# Print classifier and parameters
print('****** START',prefix, name,'*****')
parameters = params_grid[name]
fit_parameter = fit_parms_grid[name]
total_exp = len(ParameterGrid(parameters))*5
# generate the pipeline
full_pipeline_with_predictor = Pipeline([
('scale', Custom_scaling()),
("predictor", classifier)
])
# Execute the grid search
params = {}
for p in parameters.keys():
pipe_key = 'predictor__'+str(p)
params[pipe_key] = parameters[p]
print("Parameters:")
for p in sorted(params.keys()):
print("\t"+str(p)+": "+ str(params[p]))
fit_params = {}
for f in fit_parameter.keys():
fit_key = 'predictor__'+str(f)
fit_params[fit_key] = fit_parameter[f]
grid_search = GridSearchCV(full_pipeline_with_predictor, params, scoring='roc_auc', cv=5,
n_jobs=n_jobs, verbose=verbose)
grid_search.fit(X_train, y_train, **fit_params)
# # Best estimator score
# best_train = pct(grid_search.best_score_)
# Best estimator fitting time
start = time()
grid_search.best_estimator_.fit(X_train, y_train)
Train_time = round(time() - start, 4)
# Best estimator prediction time
# y_pred_train = grid_search.best_estimator_.predict(X_train)
# y_pred = grid_search.best_estimator_.predict(X_test)
accuracy_train = round(accuracy_score(y_train,grid_search.best_estimator_.predict(X_train)), 3)
roc_train = round(roc_auc_score(y_train, grid_search.best_estimator_.predict_proba(X_train)[:, 1]), 3)
accuracy_test = round(accuracy_score(y_test,grid_search.best_estimator_.predict(X_test)), 3)
roc_test = round(roc_auc_score(y_test, grid_search.best_estimator_.predict_proba(X_test)[:, 1]), 3)
print('*'*40)
print('\n')
title = name + " - Normalized Confusion Matrix"
disp = plot_confusion_matrix(grid_search.best_estimator_,X_test, y_test, normalize='true')
disp.ax_.set_title(title)
plt.show()
print('-'*40)
print('\n')
disp2 = plot_roc_curve(grid_search.best_estimator_, X_test, y_test, )
disp2.ax_.set_title ("ROC curve - " + name)
plt.show()
print('*'*40)
print('\n')
# Collect the best parameters found by the grid search
print("Best Parameters:")
best_parameters = grid_search.best_estimator_.get_params()
param_dump = []
for param_name in sorted(params.keys()):
param_dump.append((param_name, best_parameters[param_name]))
print("\t"+str(param_name)+": " + str(best_parameters[param_name]))
print("****** FINISH",prefix,name," *****")
print("")
print("-"*40)
print("*"*40)
print("-"*40)
# Record the results
explog.loc[len(explog)] = [prefix+name, accuracy_train, roc_train,
accuracy_test, roc_test, Train_time, total_exp,
json.dumps(param_dump)]
sttime = datetime.now().strftime('%Y%m%d_%H:%M:%S - ')
display(explog)
explog.to_csv(sttime + 'experiment_log.csv', index = False)
full_train['TARGET'].value_counts(normalize = True).round(2).plot.pie(autopct='%1.1f%%')
%%time
# This might take a while
if __name__ == "__main__":
ConductGridSearch(X_train, y_train, X_test, y_test, 0, "Best Model Unbalanced:", n_jobs=-1,verbose=1, )
def manual_balancing(df, ratio = 0.25):
assert (ratio <= 0.5), "ratio must be less than or equal to 0.5"
target1 = df[df['TARGET'] == 1]
target0 = df[df['TARGET'] == 0]
ratio_eq = ratio / (1-ratio)
n = int(min(target1.shape[0]/ratio_eq, target0.shape[0]))
majority_df = target0.sample(n, random_state = 42)
df_balanced = pd.concat([target1, majority_df]).sample(frac=1).reset_index(drop = True)
return df_balanced
df_balanced = manual_balancing(df_train_reduced, 0.25)
X_bal = df_balanced.drop('TARGET', axis = 1).values
y_bal = df_balanced['TARGET'].values
df_balanced['TARGET'].value_counts(normalize = True).round(2).plot.pie(autopct='%1.1f%%')
X_train, X_test, y_train, y_test = train_test_split(X_bal, y_bal , stratify = y_bal,
test_size = 0.3, random_state = 42)
%%time
# This might take a while
if __name__ == "__main__":
ConductGridSearch(X_train, y_train, X_test, y_test, 0, "Best Model Manual Balanced 50%:", n_jobs=-1,verbose=1)
df_balanced = manual_balancing(df_train, ratio=0.33)
X_bal = df_balanced.drop('TARGET', axis = 1).values
y_bal = df_balanced['TARGET'].values
df_balanced['TARGET'].value_counts(normalize = True).round(2).plot.pie(autopct='%1.1f%%')
X_train, X_test, y_train, y_test = train_test_split(X_bal, y_bal , stratify = y_bal,
test_size = 0.3, random_state = 42)
%%time
# This might take a while
if __name__ == "__main__":
ConductGridSearch(X_train, y_train, X_test, y_test, 0, "Best Model Manual Balanced 50%:", n_jobs=-1,verbose=1)
df_balanced = manual_balancing(df_train, ratio=0.5)
X_bal = df_balanced.drop('TARGET', axis = 1).values
y_bal = df_balanced['TARGET'].values
df_balanced['TARGET'].value_counts(normalize = True).round(2).plot.pie(autopct='%1.1f%%')
X_train, X_test, y_train, y_test = train_test_split(X_bal, y_bal , stratify = y_bal,
test_size = 0.3, random_state = 42)
%%time
# This might take a while
if __name__ == "__main__":
ConductGridSearch(X_train, y_train, X_test, y_test, 0, "Best Model Manual Balanced 1:1", n_jobs=-1,verbose=1)
X = df_train_reduced.drop('TARGET', axis = 1).values
y = df_train_reduced['TARGET'].values
X_train, X_test, y_train, y_test = train_test_split(X, y , stratify = y,
test_size = 0.3, random_state = 42)
%%time
from sklearn.preprocessing import StandardScaler
# This might take a while
if __name__ == "__main__":
Scaled_ConductGridSearch(X_train, y_train, X_test, y_test, 0, "Best Model Scaled:", n_jobs=-1,verbose=1)
%%time
from sklearn.decomposition import PCA
from sklearn.ensemble import GradientBoostingClassifier
# This might take a while
if __name__ == "__main__":
PCA_ConductGridSearch(X_train, y_train, X_test, y_test, 0, "Best Model PCA:", n_jobs=-1,verbose=1)
X = df_train_reduced.drop('TARGET', axis = 1).values
y = df_train_reduced['TARGET'].values
X_train, X_test, y_train, y_test = train_test_split(X, y , stratify = y,
test_size = 0.3, random_state = 42)
%%time
# This might take a while
from imblearn.pipeline import Pipeline as imbpipe
from imblearn.over_sampling import SMOTE
if __name__ == "__main__":
# multiprocessing requires the fork to happen in a __main__ protected
# block
# find the best parameters for both the feature extraction and the
# classifier
# n_jobs=-1 means that the computation will be dispatched on all the CPUs of the computer.
#
# By default, the GridSearchCV uses a 3-fold cross-validation. However, if it
# detects that a classifier is passed, rather than a regressor, it uses a stratified 3-fold.
SMOTE_ConductGridSearch(X_train, y_train, X_test, y_test, 0, "Best Model SMOTE:", n_jobs=-1,verbose=1)
# display(explog)
There are total more than 18000 experiment performed. Best result comes out with synthetically balanced dataset and LightGBM model with Kaggle score of 0.794. Training time for best model is 138 seconds.
Second best model is developed using unbalanced dataset with Kaggle score of 0.792. Training time for second best model is 233 second. It is also obeserved that scaling and PCA has some adverse effect on prediction of results.
Lowest performing model is with PCA where Kaggle score is just above 0.6.
It is recommended to use unbalanced dataset with XGBoost model as it also represent a real scenario.
Best performing features are
Term of Loan (Engineered Feature)
Mean of External Sources (Engineered Feature)
Length of Employment (Given)
External Source Score (Given)
Future monthly installment ratio (Engineered)
Anuuity Amount to Income Ratio (Engineered)
Several features that looks important at first sight were given zero importance by model in order to predict outcome.
Some of the features that were given zero importance by models are
Industry where prople work
Requirement of certain document
Whether person is unemployed or student, as long as he has income
Whether person is first time applicant or not, it does not matter.
explog
All above model and their results are submitted to Kaggle. Best result on Kaggle is coming from LightGBM model using Synthetically balanced dataset. Below is snapshot of all submission on Kaggle.
At the end of Phase2, more than 25000 experiments performed in order to evaluate best model. However, SMOTE synthetically balanced dataset generates best Kaggle score of 0.794, second best model with Unbalanced dataset is recommended. Second best model represents real world scenario where clearly one class is in majority.
Further, this results can be further evaluated using Neural Network techniques and compared their result with traditional classifiers. Next phase will be more focused on developing Neural Network model and compared its result with traditional classfier resutls.